sebastian-hofstaetter / matchmaker

Training & evaluation library for text-based neural re-ranking and dense retrieval models built with PyTorch
https://neural-ir-explorer.ec.tuwien.ac.at/
Apache License 2.0
259 stars 30 forks source link

use as a matchmaker as a library #6

Closed cmacdonald closed 3 years ago

cmacdonald commented 4 years ago

Hi,

I'm interested in using matchmaker as a library. Some related questions/comments:

  1. Being able to pip install and import would help - could you make a setup.py and init.py, then I could use pip "install git+" ?

  2. What is the minimum data I need to supply. Text of the query, text of the documents, + labels. What is the format of idf_embedder?

  3. Could the training loop be separated into a method that can be easily called with the above?

Craig

sebastian-hofstaetter commented 4 years ago

Hi Craig,

  1. I'll look into that, I am currently doing a bit of cleanup (as the current repo version is based on allennlp 0.9 and the new 1.0 has a couple of breaking changes) and then I can also add the init and setup.

  2. I would say that is one of the drawbacks currently, that for the library to work we need to split training and validation data for faster pre-processing in separate python processes, but in theory that could also be done in the train-loop process to allow for the "method-call" format.

  3. Yes, I think I could do that. Currently the train.py is so large to accommodate a large range of configurations, that are probably not needed for most use-cases. Could you elaborate a bit more on how you would want to use the library? For training, just for inference, or both? Thank you.

Best, Sebastian

cmacdonald commented 4 years ago
  1. Here was how it was done elsewhere: https://github.com/Georgetown-IR-Lab/cedr/pull/27/files

2 & 3: I'm on a crusade against a proliferation of commandline interfaces to deep neural toolkits. I'm trying to make everything work in a Python API where:

See example usages at https://github.com/cmacdonald/pyterrier_bert

I understand you might need (I)DF values? perhaps if a simple API can provide these?

Craig