google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.33k stars 352 forks source link

How to get the embedding vector or matrix after pre-training #14

Open pidahbus opened 4 years ago

pidahbus commented 4 years ago

Hi, following the commands, I pre-trained electra-small on my dataset. After pre-training I want the learned embeddings which I need to use on some other complicated downstream tasks. Could you please help me with how to extract the word embeddings after pre-training?

stefan-it commented 4 years ago

Hi @pidahbus ,

this is not a complete solution for your issue, but you can have a look at this script from @LysandreJik, that retrieves embeddings from the original TF model:

https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed

You just need to adjust the input data with real ids from the BERT vocab incl. [CLS] and [SEP] as special tokens around and then you can pass this id sequence into the model.

(Aim of the script is to test the difference between the original TF model from the official ELECTRA implementation and the upcoming PyTorch version in Hugging Face's Transformers library).

pidahbus commented 4 years ago

Hi @stefan-it, I went through the code and made some necessary changes in the .py files to extract the generator/discriminator embeddings. Would you want me to send you the pull request regarding this?

yebingxue commented 4 years ago

hi @pidahbus , could you share your changes about extracting the generator/discriminator embeddings. in your github? Thanks

cirisjl commented 3 years ago

Hi @pidahbus! Could you please send me the pull request regarding extracting the generator/discriminator embeddings? Thanks!