Embedding many large text documents.

I am embedding large noisy text docs (websites) using UMAP for further downstream tasks, including classification (site category), regression (advertising performance) and clustering.

Sometimes the embedding works well and others less so! I am still not able to understand how to tune the process and ensure we can achieve an optimal embedding, the idea is to have a single embedding which can be used across the various tasks (if possible).

The main issue I am having is that often there is not huge amount of separation in the data, often the result is as below:

Where we have one large group and then a few well defined smaller groupings. This can be explained that lost of internet / news content can be quite generic, but I am still looking for a better way to separate articles even say if they are both news yet one is financial news whilst the other climate.

How does UMAP deal with more data for example is one global embedding learnt on millions on various sites a good idea, or am I better off to split this up in categorised embeddings?

What hyperparams would be important in such a task?

One more piece of useful info would be I am applying tfidf on my raw docs before passing data to UMAP.

Sorry about the long rant and many questions 👍

EDIT: just one more Q, is it ever a good idea to scale the output of UMAP?

Hi there,

My longer term and probably less useful answer is that we're currently working through the details and theory of how best to apply UMAP style embeddings to text documents. It can be useful to remove global effects of words from your representation. Tfidf does a decent job of downweighting these effects but it very much looks like you need to apply stronger transformers to remove the global (or at least broad) language effects that are tying together your document space. Our preliminary work, in case you are interested, can be found here: https://github.com/TutteInstitute/TextMAP. We don't have much documentation or a completed paper ready yet but the building blocks for various pre-UMAP transformers can be found there if you are feeling very keen.

That said, I think the crux of your problem is that you'd like to incorporate some knowledge of your labels into your unsupervised embedding. Have you tried incorporating the semi-supervised aspects of UMAP to attempt to learn an embedding that better separates your site categories or such things?

On Thu, Aug 13, 2020 at 12:52 PM Dennis notifications@github.com wrote:

I am embedding large noisy text docs (websites) using UMAP for further downstream tasks, including classification (site category), regression (advertising performance) and clustering.

Sometimes the embedding works well and others less so! I am still not able to understand how to tune the process and ensure we can achieve an optimal embedding, the idea is to have a single embedding which can be used across the various tasks (if possible).

The main issue I am having is that often there is not huge amount of separation in the data, often the result is as below: [image: image] https://user-images.githubusercontent.com/10243849/90162867-322a9480-dd8d-11ea-9d21-215816a99a67.png

Where we have one large group and then a few well defined smaller groupings. This can be explained that lost of internet / news content can be quite generic, but I am still looking for a better way to separate articles even say if they are both news yet one is financial news whilst the other climate.

How does UMAP deal with more data for example is one global embedding learnt on millions on various sites a good idea, or am I better off to split this up in categorised embeddings?

What hyperparams would be important in such a task?

One more piece of useful info would be I am applying tfidf on my raw docs before passing data to UMAP.

Sorry about the long rant and many questions 👍

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWWTIIUL7TAWEIF2USDSAQK6LANCNFSM4P6UPHZA .

Hey @jc-healy

Thanks very much for your reply! I had a good read over the code and the notebooks for TextMAP, it looks very interesting.

I have even used the InformationWeightTransformer followed by RemoveEffectsTransformer to create a new plot the results seem positive, and we can see the large "glob" now being broken up in the middle a little which is great.

Are there any further hyperparams one could/should tweak in the UMAP step itself to create a little more distance between the inner clusters?

To answer your question, the categories themselves are predicted - I did not want to add those in this part of the pipeline. But I can give this a try.

Glad that I could help and that our recent work has been helpful for you. Nice plot. The easiest way to break apart a cluster like that would be to reduce your n_neighbours parameter in your umap. The concern there would be that it might break apart some of those peripheral clusters on the right as well.

If you really want to shatter your space you might look into reducing the set_op_mix_ratio parameter away from 1 and towards 0. Intuitively that will require both points to consider each other close before UMAP will consider them to be close. This is actually a soft dial shifting the way we symmetrize the nearest neighbour graph from union to intersection. As a standard warning this can often destroy the more global structure in your data by cutting your space up into completely disconnected components.

On Wed, Aug 19, 2020 at 9:40 AM Dennis notifications@github.com wrote:

Hey @jc-healy https://github.com/jc-healy

Thanks very much for your reply! I had a good read over the code and the notebooks for TextMAP, it looks very interesting.

I have even used the InformationWeightTransformer followed by RemoveEffectsTransformer to create a new plot the results seem positive, and we can see the large "glob" now being broken up in the middle a little which is great.

[image: image] https://user-images.githubusercontent.com/10243849/90640642-2248ff80-e228-11ea-922d-0c0142066fb7.png

Are there any further hyperparams one could/should tweak in the UMAP step itself to create a little more distance between the inner clusters?

To answer your question, the categories themselves are predicted - I did not want to add those in this part of the pipeline. But I can give this a try.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/478#issuecomment-676375030, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWVNOCWOFOOHSQDX4ETSBPI3BANCNFSM4P6UPHZA .

lmcinnes / umap

Embedding many large text documents. #478