rejasupotaro / amazon-product-search

22 stars 5 forks source link

Introduce Eland #10

Closed rejasupotaro closed 1 year ago

rejasupotaro commented 1 year ago

Summary

This PR enables the import of encoders into Elasticsearch via eland.

Eland DataFrame

Eland allows us to analyze indexed docs with the APIs compatible with Pandas. It appears that eland maps pandas APIs into Elasticsearch APIs as implemented here.

Since eland retrieves data from Elasticsearch, any excluded columns are not retrieved. In this project, the field product_vector is defined as an excluded field (mappings._source.excludes: "product_vector"), which results in the values being NaN as shown below.

>>> import eland as ed
>>> df = ed.DataFrame("http://localhost:9200", es_index_pattern="products_jp")
>>> df
                      product_brand  ... product_vector
B07R91TVJB                           ...            NaN
B00FW60P84                           ...            NaN
B071KNF11Z                   SHD-PB  ...            NaN
B07HR6BC88                           ...            NaN
B089Q21H6Z                    Meize  ...            NaN
...                             ...  ...            ...
B078GHPC9T  ティーケーカンパニー (TK.Company)  ...            NaN
B07KYY329D        TRAVELIST(トラベリスト)  ...            NaN
B07T15VXM2                           ...            NaN
B07YD6R7M3                   KIZUNA  ...            NaN
B083JF93J5                  Tbmodel  ...            NaN

[100 rows x 10 columns]

Machine Learning with Eland

Another notable feature of eland is that it provides APIs for importing and executing machine learning models. Combining eland with recently added features enables more flexible vector search within Elasticsearch.

Importing Encoders

I have added a task for importing models via eland.

$ poetry run inv es.import-model

This is equivalent to the below command.

$ eland_import_hub_model \
  --url http://localhost:9200 \
  --hub-model-id sonoisa/sentence-bert-base-ja-mean-tokens-v2 \
  --task-type text_embedding \
  --start
AuthorizationException: current license is non-compliant for [ml] I got this error when I tried to import a model. It turns out that importing Pytorch models is a platinum-licensed feature. ```shell $ poetry run inv es.import-model Traceback (most recent call last): File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/bin/inv", line 8, in sys.exit(program.run()) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/invoke/program.py", line 384, in run self.execute() File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/invoke/program.py", line 569, in execute executor.execute(*self.tasks) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/invoke/executor.py", line 129, in execute result = call.task(*args, **call.kwargs) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/invoke/tasks.py", line 127, in __call__ result = self.body(*args, **kwargs) File "/Users/kentaro-takiguchi/projects/amazon-product-search/tasks/es_tasks.py", line 50, in import_model ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/eland/ml/pytorch/_pytorch_model.py", line 122, in import_model self.put_config(path=config_path, config=config) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/eland/ml/pytorch/_pytorch_model.py", line 78, in put_config self._client.ml.put_trained_model(model_id=self.model_id, **config_map) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped return api(*args, **kwargs) File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/elasticsearch/_sync/client/ml.py", line 3301, in put_trained_model return self.perform_request( # type: ignore[return-value] File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/elasticsearch/_sync/client/_base.py", line 389, in perform_request return self._client.perform_request( File "/Users/kentaro-takiguchi/projects/amazon-product-search/.venv/lib/python3.10/site-packages/elasticsearch/_sync/client/_base.py", line 320, in perform_request raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( elasticsearch.AuthorizationException: AuthorizationException(403, 'security_exception', 'current license is non-compliant for [ml]') ``` I started the trial by calling the following API. ``` $ curl -X POST http://localhost:9200/_license/start_trial?acknowledge=true ``` It costs $125 per month ([Official Elasticsearch Pricing: Elastic Cloud, Managed Elasticsearch | Elastic](https://www.elastic.co/pricing/)).

Encoding Queries at Query Time

As of Elasticsearch 8.7, encoders can be executed at query time.

Index docs:

$ poetry run inv es.index \
  --index-name=products_jp \
  --locale=jp \
  --dest-host=http://localhost:9200 \
  --extract-keywords \
  --encode-text \
  --nrows=100

Retrieve docs using the imported model:

GET products_jp/_search
{
  "query": {
    "match": {
      "product_title": {
        "query": "東京"
      }
    }
  },
  "knn": [
    {
      "field": "product_vector",
      "k": 10,
      "num_candidates": "100",
      "query_vector_builder": {
        "text_embedding": {
          "model_id": "sonoisa__sentence-bert-base-ja-mean-tokens-v2",
          "model_text": "東京"
        }
      }
    }
  ]
}