koursaros-ai / nboost

NBoost is a scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
Apache License 2.0
674 stars 69 forks source link

Do I need to re-index an existing ElasticSearch index? #66

Closed klasocki closed 4 years ago

klasocki commented 4 years ago

I have an existing large index inside my ElasticSearch (~million documents, some of them pretty long). I would like to use it with nboost, but avoid costly re-indexing and creating a csv file.

Is it possible, or do I need to use the nboost-index tool every time I want to work with new data?

MartinXPN commented 4 years ago

No, you don't need to reindex when using nboost. Nboost works as a proxy and sits between the user requesting the data and the elasticsearch.

When the request is sent:

So, the only change you'll need to do is to send the requests to nboost instead of sending them directly to elasticsearch.

klasocki commented 4 years ago

In that case I think there is an issue my proxy. Should I open a new issue? I have an index with polish wikipedia (I want to use it with the default tinybert model, which probably doesn't support polish, just to check if it's working at all), and when I query it directly:

curl localhost:9200/wikipedia/_search?pretty&q=text:test&size=1

it gives results as expected, however when I try to do it through nboost:

curl "localhost:8000/wikipedia/_search?pretty&q=text:test"

then all I get is an empty list of hits

{ "took": 16, "timed_out": false, "_shards": { "total": 4, "successful": 4, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 6019, "relation": "eq" }, "max_score": 10.853271, "hits": [] }, "nboost": { "scores": [] } }

What could be happening? Is it possible that the model thinks none of the results are valid and thus returns none? It works just fine with the travel index provided with nboost. Here is the command I use to run nboost:

` nboost \

--uhost localhost                   \

--uport 9200                        \

--query_path url.query.q            \

--topk_path url.query.size          \

--default_topk 10                   \

--choices_path body.hits.hits       \

--cvalues_path _source.passage     \

--search_route "/wikipedia/_search"   \`
MartinXPN commented 4 years ago

Are your Wikipedia texts located at _source.passage? You are providing --cvalues_path _source.passage but your texts might be located in a different path.

klasocki commented 4 years ago

No, I have two fields in _source attribute of a hit: title and text. I changed the setting to the default - --cvalues_path _source.* but it didn't help. Now I get the following error:

Traceback (most recent call last): File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/proxy.py", line 123, in proxy_through plugin.on_response(response, db_row) File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/plugins/rerank/base.py", line 39, in on_response reranked_choices = [response.choices[rank] for rank in ranks] File "/net/scratch/people/plgklasocki/transformers-env/lib/python3.6/site-packages/nboost/plugins/rerank/base.py", line 39, in <listcomp> reranked_choices = [response.choices[rank] for rank in ranks] IndexError: list index out of range

MartinXPN commented 4 years ago

I think nboost should be configured to re-rank based on one text filed. I'd suggest changing the default --cvalues_path _source.* to --cvalues_path _source.text

klasocki commented 4 years ago

Yes that worked, thank you!! So does it mean that nboost doesn't support multi-match queries?

MartinXPN commented 4 years ago

yeah it supports all the queries elasticsearch does, cause it forwards the queries to elasticsearch, but the re-ranking can only be done on one text field.

klasocki commented 4 years ago

Thank you very much, I'm closing :)