Closed klasocki closed 4 years ago
https://github.com/huggingface/transformers has a training module now and we use those wrappers for the models. It should be a 2 output model like https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification
Should I use the BertForSequenceClassification model class? And a < SEP > token between query and a passage in the input?
@pertschuk can you provide BertConfig
, required by BertForSequenceClassification
that would be compatible with nboost?
@klasocki the transformers tokenizer.encode function supports two arguments, and automatically adds SEP (add_special_tokens=True or something), first should be query second passage
@apohllo BertConfig can be anything so long as output size = 2 (binary classification)
@pertschuk Thank you for your help, I finally managed to use my model!
I ran into one more problem with token_type_ids, which I fixed in #73
@klasocki Could you share your code? I'd like to try training my own model as well.
@petulla I used the code provided in this great tutorial series to train the model and it worked with nboost
OK seems straightforward, just using transformers/bestsequenceclassifier. This notebook seems like the relevant code.
One question: Did you train on document/query pairs or paragraph/query? @klasocki
Document/query, since Elastic performed worse on paragraphs in my case
I think you mean elastic performed better, @klasocki?
Does your setup do reranking? Like: Elastic ranks first 1000, then you re-rank the top 1000? I'm considering a setup like that; wasn't sure if nboost works that way. I'm confused if the entire document is fed into BERT since bert only takes the first 512 tokens.
No, I actually mean worse 😆 But that could just be my data. Truncating the docs to 512 tokens worked quite well, since for Wikipedia search most important information is in the beginning anyway.
Yes nboost works that way. Usual approach is that you ask elastic for e.g. 100 documents, then nboost re-ranks them (based solely on the model, not weighting with ES) and returns e.g. 10 for you
Connected to #45, #49 and #35 I am struggling to get nboost working with a custom model - I am not sure where to start. What exactly needs to be the input and output of a model? What function is called?
I tried to use a model trained using code from https://github.com/ThilinaRajapakse/simpletransformers#minimal-start-for-sentence-pair-classification with regression, but no luck. I wasn't able to set the --model argument, it keeps telling me that PtBertRerankModelPlugin is not in MODULE_MAP. It loads and nboost starts, but it raises exceptions with each query:
Models from simple transformers use input as in
model.predict([[query, text]])
is that ok? Should it use .forward, or different input? What should the output be - single value between 0 and 1, a tensor (dimensions? ) Do you recommend a way to train such models (sentence-transformers, vanilla huggingface/transformers?)