koursaros-ai / nboost

NBoost is a scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
Apache License 2.0
675 stars 69 forks source link

How exactly can I use a custom model with nboost? #69

Closed klasocki closed 4 years ago

klasocki commented 4 years ago

Connected to #45, #49 and #35 I am struggling to get nboost working with a custom model - I am not sure where to start. What exactly needs to be the input and output of a model? What function is called?

I tried to use a model trained using code from https://github.com/ThilinaRajapakse/simpletransformers#minimal-start-for-sentence-pair-classification with regression, but no luck. I wasn't able to set the --model argument, it keeps telling me that PtBertRerankModelPlugin is not in MODULE_MAP. It loads and nboost starts, but it raises exceptions with each query:

  File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/proxy.py", line 123, in proxy_through
    plugin.on_response(response, db_row)
  File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/plugins/rerank/base.py", line 34, in on_response
    filter_results=response.request.filter_results
  File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/plugins/rerank/base.py", line 65, in rank
    score = logit[1]
IndexError: index 1 is out of bounds for axis 0 with size 1

Models from simple transformers use input as in model.predict([[query, text]]) is that ok? Should it use .forward, or different input? What should the output be - single value between 0 and 1, a tensor (dimensions? ) Do you recommend a way to train such models (sentence-transformers, vanilla huggingface/transformers?)

pertschuk commented 4 years ago

https://github.com/huggingface/transformers has a training module now and we use those wrappers for the models. It should be a 2 output model like https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

klasocki commented 4 years ago

Should I use the BertForSequenceClassification model class? And a < SEP > token between query and a passage in the input?

apohllo commented 4 years ago

@pertschuk can you provide BertConfig, required by BertForSequenceClassification that would be compatible with nboost?

pertschuk commented 4 years ago

@klasocki the transformers tokenizer.encode function supports two arguments, and automatically adds SEP (add_special_tokens=True or something), first should be query second passage

@apohllo BertConfig can be anything so long as output size = 2 (binary classification)

klasocki commented 4 years ago

@pertschuk Thank you for your help, I finally managed to use my model!

I ran into one more problem with token_type_ids, which I fixed in #73

petulla commented 4 years ago

@klasocki Could you share your code? I'd like to try training my own model as well.

klasocki commented 4 years ago

@petulla I used the code provided in this great tutorial series to train the model and it worked with nboost

petulla commented 4 years ago

OK seems straightforward, just using transformers/bestsequenceclassifier. This notebook seems like the relevant code.

petulla commented 4 years ago

One question: Did you train on document/query pairs or paragraph/query? @klasocki

klasocki commented 4 years ago

Document/query, since Elastic performed worse on paragraphs in my case

petulla commented 4 years ago

I think you mean elastic performed better, @klasocki?

Does your setup do reranking? Like: Elastic ranks first 1000, then you re-rank the top 1000? I'm considering a setup like that; wasn't sure if nboost works that way. I'm confused if the entire document is fed into BERT since bert only takes the first 512 tokens.

klasocki commented 4 years ago

No, I actually mean worse 😆 But that could just be my data. Truncating the docs to 512 tokens worked quite well, since for Wikipedia search most important information is in the beginning anyway.

Yes nboost works that way. Usual approach is that you ask elastic for e.g. 100 documents, then nboost re-ranks them (based solely on the model, not weighting with ES) and returns e.g. 10 for you