Embeddings - Githubissues

seanmacavaney commented 4 years ago

Some additional documentation about the source of the embedding files would be helpful to get it running. For instance, what are these two files for fasttext?

https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/configs/datasets/tr-msmarco-passage.yaml#L134

sean

sebastian-hofstaetter commented 4 years ago

Hi Sean,

Thanks for your interest in the library and models :)

The two files can be created from a .bin fasttext file with the following helper scripts: https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_fasttext_vocab_mapping.py & https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_fasttext_weights.py

(If you don't use fasttext in the config "token_embedder_type" then the paths are ignored)

Although, I would not recommend using fasttext, as we only saw improvements for non-transformer models and as soon as we applied transformers to the embedded sequences the embedding method did not matter much (so we stayed with Glove for TK and TKL) and fasttext makes the embedding process more complex.

Best, Sebastian

seanmacavaney commented 4 years ago

Ah, gotcha. Thanks for the info! Where can I find/make the files for the glove embeddings then [1]? From what I can tell in the code, they are built with AllenNLP?

[1] https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/configs/datasets/tr-msmarco-passage.yaml#L129

sebastian-hofstaetter commented 4 years ago

So you could generate collection specific vocabularies with: https://github.com/sebastian-hofstaetter/transformer-kernel-ranking/blob/master/matchmaker/preprocessing/generate_vocab.py or for convenience I just pushed the glove 42b vocabulary in the matchmaker/vocabs folder.

seanmacavaney commented 4 years ago

Worked like a charm!

sebastian-hofstaetter / matchmaker

Embeddings #7