openfoodfacts / search-a-licious

🍊🔎 A pluggable search service for large collections of objects (like Open Food Facts)
https://search.openfoodfacts.org
GNU Affero General Public License v3.0
7 stars 5 forks source link

Initial service commit #1

Closed simonj2 closed 2 years ago

simonj2 commented 2 years ago

What

Initial commit of the search service.

Sorry for the large commit, but this allowed me to scope out the proposal better.

Contains:

stephanegigandet commented 2 years ago

That looks awesome, thank you very much @simonj2 !

The CSV export is very partial, so at some point it will be better to switch to the MongoDB export.

There are few things that we'll to think about at some point. In particular taxonomized fields like the categories are not stored with searchable textual names. The list of categories in in categories_tags, and it's an array with entries like "en:coffee". There is a field "categories" which can be in any language and should not be used for search.

One solution could be to include values for one language (e.g. for a French product, if there's categories_tags: "en:coffee", we also include categories: "café". Then search would work for one language (not so good for products sold in countries with multiple languages, like Belgium or Canada).

For language specific fields like product_name, we also have entries like product_name_en, product_name_fr etc. (product_name contains a copy of the value for the main language of the product).

simonj2 commented 2 years ago

Thanks for the feedback @stephanegigandet ! I've incorporated the comments you had about the code.

The CSV export is very partial, so at some point it will be better to switch to the MongoDB export.

Makes sense - will look at this in the future.

There are few things that we'll to think about at some point. In particular taxonomized fields like the categories are not stored with searchable textual names. The list of categories in in categories_tags, and it's an array with entries like "en:coffee". There is a field "categories" which can be in any language and should not be used for search.

One solution could be to include values for one language (e.g. for a French product, if there's categories_tags: "en:coffee", we also include categories: "café". Then search would work for one language (not so good for products sold in countries with multiple languages, like Belgium or Canada).

For language specific fields like product_name, we also have entries like product_name_en, product_name_fr etc. (product_name contains a copy of the value for the main language of the product).

Interesting! I think this comes back to the schema discussion. Fields aren't discoverable, documented, or consistent (ie, certain times the language is embedded in the value en:coffee, sometimes it's in the field name: product_name_fr.

A unified schema for data representation could solve this, ie, every field could be stored as a map of countrycode->value:

  product_name: {
    en: “Quarter Pounder”,
    fr: “Royale with Cheese”
  }

This could then be returned to the clients directly, and a convenience API could be offered to return the correct value if a language parameter is provided. The same logic could apply for tags (filtering if a language parameter is provided).

I could write logic to do things like:

But, I think that's probably fragile - new fields will break this, we don't make our API consumers lives any easier by consolidating fields, etc.

Perhaps let's see how the schema work lands, then it could be incorporated here quite easily?