Word2Vec and Doc2Vec support

x-tabdeveloping commented 1 year ago

Hello We had a talk over at another issue on sklego about potentially including Word2Vec and Doc2Vec support in embetter. We already have a lot of code and went through a lot of considerations about how this could or should be done with a colleague at the Center for Humanities computing. This repo contains most of what we cooked up, but here are some considerations that guided our choices and some of the compromises we made. I'm interested to hear your opinion @koaning, cause I would be willing to join forces and implement this in embetter.

Here is how we use word2vec and doc2vec for the most part:

We train models so we can capture particular relations in a relatively small corpus. In these instances we usually have to do extensive cleaning, lemmatization and such.
We train models on large datasets, where a lot of streaming and quality filtering has to be done.

The fundamental problem is that there is no canonical implementation of sentencization or tokenization in gensim for these models, so you somehow have to do these steps manually. So we figured that introducing some components that can do this for us would be useful. We started out with implementing a SpacyPreprocessor component, that would only let certain patterns of tokens pass, and would lemmatize and sentencize if we want it to. I also implemented a dummy version of this. As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser. Now one consideration that I was particularly thinking a lot about and it's still haunting me and I'm not sure how many iterations we have to go through before we find the right solution is how to preserve the inherent hierarchical structure of the data throughout the pipeline. Namely:

documents
- sentences
  - tokens We settled on a solution where the preprocessor component returns a nested iterable (currently a list but as I'm writing this I'm thinking about using an Awkward Array instead.).

One could think that this should be delegated to some preprocessing step outside the pipeline, but I would argue that having it in the pipeline prevents a lot of errors in production. Let's say you want to train a word embedding model only on lemmas. If you do not include the lemmatization as part of the pipeline, then you have to replicate the lemmatization behavior in production too not just in the training script.

We also have Word2Vec and Doc2Vec transformer/vectorizer objects, that take these ragged structures and turn them into embeddings. transform() with Word2Vec for example also returns a ragged Awkward Array with the same hiearchical structure as the documents themselves. This is great because it allows you to use the individual words or sentences downstream if you want to. We also included wrangler components, that can flatten/pool these structures. Here's how for example a Word2Vec-average encoding pipeline looks in our emerging framework.

import spacy
from skpartial.pipeline import make_partial_pipeline
from skword2vec.wranglers import ArrayFlattener, Pooler
from skword2vec.preprocessing.spacy import SpacyPreprocessor
from skword2vec.models.word2vec import Word2VecVectorizer

nlp = spacy.load("en_core_web_sm")
preprocessor = SpacyPreprocessor(nlp, sentencize=True, out_attribute="LEMMA")
embedding_model = Word2VecVectorizer(n_components=100, algorithm="sg")

embedding_pipeline = make_partial_pipeline(
  preprocessor,
  embedding_model,
  # Here we need to flatten out sentences
  ArrayFlattener(),
  # Then pool all embeddings in a document
  # mean is the default
  Pooler(),
)

Now I know this is vastly different from how most encoders work in embetter, but nonetheless I wanted to put this out here to start a discussion about how you imagine these would work in embetter. I am flexible and open to suggestions and compromises and ready to implement if need be :).

koaning commented 1 year ago

Ah. I think there's a difference between what you're trying to accomplish and what this library does. The goal of embetter is to make it easy to re-use pre-existing pre-trained embeddings in scikit-learn and to (maybe) fine-tune them.

Your library seems to focus more on training embeddings. Which feels out of scope. My hope is that the finetuning components may compensate for that use-case. I have been toying around with featherbed to train custom embeddings with a "lightweight trick", but I've personally found it hard to train embeddings locally that are better than what other libraries offer pre-trained already. Not just in terms of "cosine-distance-metrics" but also in terms of "inference speed".

As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.

About that. Part of me regrets creating tokenwiser. It would have been better if I tried the ideas in some experiments before making implementations in a pip-installable package. In hindsight, the ideas didn't work that well and the implementations were pretty slow. You can still download it, use it, but I've stopped maintaining it for a while now. The useful ideas, like the partial pipeline, have moved into separate packages.

Word2Vec and Doc2Vec support

If there's a clear use-cae for add support for these kinds of models, maybe via something like gensim, then this is certainly something we might still discuss.

x-tabdeveloping commented 1 year ago

Yeah, okay, makes perfect sense. We do need to train the embeddings ourselves as we usually use them to capture implicit semantic relations in the corpora we study. Also we work a lot with Danish, for which there isn't an abundance of great embeddings and it's something the center wants to develop.

I thought tokenwiser was an interesting project, but I definitely understand how it must have been a bit difficult to structure sensibly. I think we will still keep trying to develop some tokenization utilities for model training.

As far as then things in embetter go, I have quite a bit of experience working with gensim's embedding models, if you think Doc2Vec and Word2Vec would be worth having in embetter I can certainly contribute to that. I think it would also be beneficial for our work to have easily employable sklearn compatible components for using our pretrained embeddings.

koaning commented 1 year ago

This issue was fixed by this PR: https://github.com/koaning/embetter/pull/76

koaning / embetter

Word2Vec and Doc2Vec support #75