Closed seanmacavaney closed 4 years ago
Hi Sean,
Thanks for your interest in the library and models :)
The two files can be created from a .bin fasttext file with the following helper scripts: https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_fasttext_vocab_mapping.py & https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_fasttext_weights.py
(If you don't use fasttext in the config "token_embedder_type" then the paths are ignored)
Although, I would not recommend using fasttext, as we only saw improvements for non-transformer models and as soon as we applied transformers to the embedded sequences the embedding method did not matter much (so we stayed with Glove for TK and TKL) and fasttext makes the embedding process more complex.
Best, Sebastian
Ah, gotcha. Thanks for the info! Where can I find/make the files for the glove embeddings then [1]? From what I can tell in the code, they are built with AllenNLP?
So you could generate collection specific vocabularies with: https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_vocab.py or for convenience I just pushed the glove 42b vocabulary in the matchmaker/vocabs folder.
Worked like a charm!
Some additional documentation about the source of the embedding files would be helpful to get it running. For instance, what are these two files for fasttext?
https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/configs/datasets/tr-msmarco-passage.yaml#L134