pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 810 forks source link

combining TEXT.build_vocab with flair embeddings #650

Closed antgr closed 4 years ago

antgr commented 4 years ago

❓ Questions and Help

Description

Hi, we can use glove embedding when building vocab, using something like:

MIN_FREQ = 2

TEXT.build_vocab(train_data, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.300d",
                 unk_init = torch.Tensor.normal_)

We also can create embeddings using flair library, using for example:

embedding_types: List[TokenEmbeddings] = [

    WordEmbeddings('glove'),

    # comment in this line to use character embeddings
    #CharacterEmbeddings(),

    # comment in these lines to use flair embeddings
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
    ELMoEmbeddings(),
    BertEmbeddings('bert-base-uncased'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

Could I use the above embeddings instead of glove in the above code? Is anything similar to this supported?

zhangguanheng66 commented 4 years ago

@mttk I don't think we are currently supporting this one, right?

mttk commented 4 years ago

We don't support something like this out of the box, as flair has its own API for tokenization / data loading, and there is no simple way to integrate it with torchtext.

If you really want to (e.g. for using a specific dataset), you can do an ugly workaround like:

  1. initializing a Dataset in torchtext without assigning a vectors object to your Field;
  2. constructing a fake flair Sentence object from the Field's vocabulary (e.g. sentence = Sentence(' '.join(TEXT.vocab.itos)));
  3. fetch the embeddings from the StackedEmbeddings object (stacked_embeddings.embed(sentence));
  4. stacking the resulting embeddings into a tensor and passing them to nn.Embedding.

The embeddings will be fetched in-order so this is equivalent to constructing them via torchtext. Note that I'm not super familiar with flair so there might be a more efficient way of doing this.

antgr commented 4 years ago

So, what you say is to construct a gigantic sentence with all the words of the vocab, use this sentence as argument in stacked_embeddings.embed(sentence) and iterate through the tokens of the result, concatenating them in a pytorch tensor (3), and then use this tensor for nn.Embedding. Am I right?

mttk commented 4 years ago

Correct. It is ugly, but it will work and I don't see a cleaner way to do it (again, I'm not an expert with flair).

antgr commented 4 years ago

Sorry for the silly question: I am a bit confused with the last step, on how we use this tensor for nn.Embedding. :) Thank you so much for your answer!

mttk commented 4 years ago

There are two options, both equally good: Option 1: Copy data from a tensor

vectors = # initialize the vector tensor or leave at None
embedding = nn.Embedding(num_tokens, embedding_size, padding_idx=0)
if vectors is not None: embedding.weight.data.copy_(vectors)
if freeze_vectors: embedding.weight.requires_grad = False

Option 2: use from_pretrained embedding = nn.Embedding.from_pretrained(vectors)

antgr commented 4 years ago

Great! Thank you!

mttk commented 4 years ago

No problem -- please do close the issue if everything is resolved.

Edresson commented 4 years ago

@mttk do you know if there is currently a more efficient way to do this?

Since flair embeddings are contextual embeddings, wouldn't that be a problem?

when we pass the join of TEXT.vocab.itos (sentence = Sentence ('' .join (TEXT.vocab.itos)) to the flair are we not losing context?

example of join in the IMDB dataset: " the , . a and of to is in I it that " 's this - /><br was as with movie for film The but n't ( ) on you are not have his be he one at by all ! an who they from like so her or ' has about It out just do ?"

Thank you in advance for the answer :)

mttk commented 4 years ago

@Edresson the example code would work only for the non-contextualized embeddings, in the case you want to obtain the embedding matrix vectors. This would of course fail when applied on a concrete instance.

If you want to use flair contextual embeddings, I believe you need to load and use the dataset with their pipeline. There might be a workaround, but I'm not familiar with one at the moment as I don't actively use flair.