opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
57 stars 58 forks source link

Extending Neural Search pipeline to Named entity recognition and other metadata extracting models #134

Open navneet1v opened 1 year ago

navneet1v commented 1 year ago

Copying the customer request from Forum post: https://forum.opensearch.org/t/extending-neural-search-pipeline-to-named-entity-recognition-and-other-metadata-extracting-models/13078

I have a usecase to involve a named entity recognition model for documents and queries while indexing and querying. The documents will be filtered based on the presence of extracted entities against the query’s extracted entities. The pipeline will work similar to the existing neural search pipeline with one difference that in this usecase, the queries and documents will be passed through a NER (Named entity recogntion) model and added with extra metadata such as entities instead of vectors provided by an embedding model.

So if we are able to extend the usecase of neural-search pipeline to include model(s) that enable named entities extraction, embeddings, image segments (finding image components for image search) etc., so that the query/document extracts enough metadata through various models in the list of my neural search pipeline before matching.

Please do a +1 if you are looking for this feature. If possible do a comment explaining your usecase.

navneet1v commented 1 year ago

@ylwu-amzn do ML plugin API support Named entity recognition model?

@mshyani how do we think this can impact the indexing and queries?

MilindShyani commented 1 year ago

I am not sure what's the best way to implement this. Perhaps one method would be to use a cross encoder model.

In this architecture, you first retrieve the top k documents d_i for a query and then pass (q,d_i) where i ranges from 1 to k to the model. This model, which can be an NER model, can be used to rerank the passages. I don't this is straight forward to implement with the current plugins also it is computationally expensive (since the transformer makes k passes).

Note that there is another way where a model can read the queries and find the named entities and looks for those entities in the document corpus. But this is (almost) exactly what a neural retriever does when it creates a vector for the query and looks for nearest neighbors!

There could be other ways but I can't think of any on top of my head.

navneet1v commented 1 year ago

@MilindShyani thanks for the update.

Let me do some research on how NER model works and see if I can come up with some proposed solution which can be added as a feature in Neural Search Plugin.

ylwu-amzn commented 1 year ago

ml-commons doesn't support named entity recognition model now.

prasadnu commented 1 year ago

To be bit more clear, I was thinking for neural search pipeline to be extended so that it can be used not only for retrieving vectors from an embedding model, but also for retrieving any other metadata such as entities (for both docs and queries) from a NER model.

Now, before creating a neural search pipeline, we should upload and load a ML model that provides embeddings (refer to screenshot). Here this is limited to only models that provides embeddings, if this can be extended to upload any metadata models like NER and use that model to create a neural search pipeline, it would be generic.

image

image

CodeAKrome commented 1 year ago

I'm doing NER by putting my opensearch data stream through a container which injects the entities during forwarding. So [data src] -> [injector] -> [opensearch/_bulk]. Would this be of any use to anyone, do you think? I looked at the PRs and poked around a bit and didn't see anything but this thread. I'm pulling RSS feeds. My goal is to get this working in kubernetes so I can scale it.

rs-amundaware commented 10 months ago

https://www.elastic.co/blog/how-to-deploy-nlp-named-entity-recognition-ner-example ES provides this solution. Do we or can we have this featre in opensearch as well. please let me know if it already exisits.

navneet1v commented 10 months ago

@rs-amundaware I think there was some issue in ML-Commons that was tracking adding new types of Model via MLCommons plugin. https://github.com/opensearch-project/ml-commons/issues/1164

rs-amundaware commented 10 months ago

@navneet1v Thanks. yes. waiting for that feature eagarly.