sebastian-hofstaetter / matchmaker

Training & evaluation library for text-based neural re-ranking and dense retrieval models built with PyTorch
https://neural-ir-explorer.ec.tuwien.ac.at/
Apache License 2.0
259 stars 30 forks source link

Generating pairwise distance training file #17

Open paul-chelarescu opened 2 years ago

paul-chelarescu commented 2 years ago

Hello, I am trying to get the pairwise distillation workflow run, and I'm following the instructions here but it seems like teacher-train-scorer.py already expects there to be files in the pairwise format linked in the config files? I'm trying to see if I've skipped a step, but my understanding is that once we have the canonical triples files from MSMARCO, we can use teacher-train-scorer.py to generate the pairwise distance files, but is that not the case?

I am running the following command with config files that point at the MSMARCO triples and train_pairwise_distillation: False: python matchmaker/distillation/teacher-train-scorer.py --run-name experiment1 --config-file config/train/data/recs-dataset.yaml config/train/recs.yaml config/train/models/bert_dot.yaml (where the recs yaml files are pointing at the MSMARCO triples.tsv) and yet, I'm hitting a KeyError: 'query_text' while running this command. Yet looking inside teacher-train-scorer.py, it seems like the triples file should be processed automatically, but I can't understand what's missing from the pipeline. Looking at the other files in the distillation directory doesn't seem to indicate another command for further needed pre-processing.

On the other hand, running the same command above with train_pairwise_distillation: True predictibly fails because there are no pairwise train files generated by now. I've followed all of the intructions here and here, as well as gone through all of the documentation in the project and I can't seem to understand how to make a distillation workflow properly run. Is there anything I'm glancing over? Many thanks.