Train a language identifier model that works well on ingredient lists

raphael0202 commented 2 months ago

Problem

We're currently using fasttext for language identification. This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.

However, fasttext was trained on data that is quite different from ingredient lists (Wikipedia, Tatoeba and SETimes).

Sometimes the model fails for obvious cases, such as this one (french ingredient list):

text: fraise (12%), framboise (10%)

predictions:
en, confidence=0.4291181
it, confidence=0.13040087
fr, confidence=0.0435654
ro, confidence=0.026255628
no, confidence=0.019594753
de, confidence=0.017750196
es, confidence=0.01671417
tr, confidence=0.015862297
sco, confidence=0.01577331
ms, confidence=0.015433003

This behaviour is mostly present for short ingredient lists.

We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).

Requirements

Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.

korablique commented 2 months ago

What I'm going to try (following the discussion on Slack):

select products from OFF with >80% of recognized ingredients
measure existing fasttext model quality on this data
fine tune the fasttext model
compare results

@raphael0202 If that's OK with you, could you please assign the issue to me?

raphael0202 commented 2 months ago

Yes that's a good plan to start :+1: I've assigned you on this issue

korablique commented 1 month ago

Here is the number of texts for each language

``` en 422020 fr 299681 de 89880 es 46255 it 31801 nl 19983 pl 8401 pt 8119 sv 6128 bg 4453 ro 3771 fi 3726 ru 3610 nb 3591 cs 3500 th 3157 da 2021 hr 2015 hu 1962 ar 1104 el 943 ja 912 ca 824 sr 735 sl 727 sk 606 tr 506 lt 453 zh 436 et 370 lv 333 xx 318 no 315 uk 274 id 262 he 209 vi 121 is 113 la 89 in 72 ko 71 sq 70 iw 59 ka 54 ms 52 bs 37 fa 35 bn 33 gl 32 kk 25 mk 23 nn 18 hi 18 aa 17 uz 17 so 15 af 12 eu 11 az 8 be 7 cy 7 hy 7 tt 6 ku 5 km 4 te 4 ky 4 ur 4 mg 3 ty 3 ta 3 tg 3 my 3 tl 3 mo 2 sc 2 ir 2 ne 2 tk 2 am 2 mn 2 co 2 se 2 si 2 fj 1 ch 1 ug 1 yi 1 to 1 fo 1 mt 1 ht 1 ak 1 jp 1 oc 1 lb 1 mi 1 as 1 yo 1 ga 1 gd 1 ba 1 zu 1 mr 1 ```

baslia commented 1 month ago

Hey, is that a requirements that the models needs to run locally (not from a server) ? It seems that LLMs are good with detecting languages.

korablique commented 1 month ago

Tried to use clustering to fix mislabeled data.

I took the languages for which there are at least 100 texts (37 languages). Then took 100 texts for each language and used them as a training dataset (wanted then to get predictions for the entire dataset).

The texts were converted to embeddings using fasttext (get_sentence_vector method), the dimension was reduced from 256 to 66 to preserve 95% variance using PCA. Tried 2 methods: gaussian mixture and HDBSCAN. Gaussian mixture divides the data into only 3 clusters. And HDBSCAN classifies all new data as noise. The picture below shows the result of HDBSCAN clustering of training data. The clusters are difficult to separate.

Either clustering is not suitable for this task, or I am doing something wrong.

Now I will try to use another text classification model: lingua https://github.com/pemistahl/lingua-py to compare the predictions and confidence of two models. Then I'll take data in which the predictions of the models coincide and they are both confident and fine-tune one of them on this data.

jeremyarancio commented 1 month ago

Here's a really nice article summarizing different approaches for language detection, from statistical to deep learning https://medium.com/besedo-engineering/language-identification-for-very-short-texts-a-review-c9f2756773ad

Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/

korablique commented 1 month ago

How I got the distribution of texts languages:

selected ingredients_text_{LANG} field names from mongo DB:

docker exec -i mongodb-container mongo mydatabase --quiet --eval '
var cursor = db.mycollection.aggregate([
{ "$project": {
  "fields": { "$objectToArray": "$$ROOT" }
}},
{ "$unwind": "$fields" },
{ "$match": { "fields.k": /^ingredients_text_/ }},
{ "$group": {
  "_id": null,
  "all_fields": { "$addToSet": "$fields.k" }
}},
{ "$limit": 20 }
]);
if (cursor.hasNext()) {
printjson(cursor.next().all_fields);
}' > field_names.json

then selected field values:


FIELDS=$(jq -r '.[]' field_names.json | paste -sd "," -)
docker exec -i mongodb-container mongo mydatabase --quiet --eval '
var fields = "'$FIELDS'".split(",");
var projection = {};
fields.forEach(function(field) { projection[field] = 1; });


(but after that there are still some extra fields left, e.x. `ingredients_text_with_allergens`)

3. then I made a dictionary in which text is the key, language is the value:

ingredients_text_lang_dct = dict()

with open(os.path.join(data_dir, 'filtered_extracted_values.json'), 'r') as data_file: for dct in ijson.items(data_file, 'item'): for k, v in dct.items(): if k == 'ingredients_text_withallergens': continue lang = k[k.rfind('') + 1:]

        # if the field is `ingredients_text_{LANG}_imported`
        if lang == 'imported':
            start = k[:k.rfind('_')].rfind('_') + 1
            end = k.rfind('_')
            lang = k[start:end]
        ingredients_text_lang_dct.update({v: lang})


@raphael0202

korablique commented 1 month ago

Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/

How many samples should it contain? Should I select an equal number of samples for each language or just random? @jeremyarancio

jeremyarancio commented 1 month ago

Roughly 30 labels per language to start with I would say. It's just to have an idea about the performances

Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/

How many samples should it contain? Should I select an equal number of samples for each language or just random? @jeremyarancio

baslia commented 1 month ago

Here is the number of texts for each language

en    422020
fr    299681
de     89880
es     46255
it     31801
nl     19983
pl      8401
pt      8119
sv      6128
bg      4453
ro      3771
fi      3726
ru      3610
nb      3591
cs      3500
th      3157
da      2021
hr      2015
hu      1962
ar      1104
el       943
ja       912
ca       824
sr       735
sl       727
sk       606
tr       506
lt       453
zh       436
et       370
lv       333
xx       318
no       315
uk       274
id       262
he       209
vi       121
is       113
la        89
in        72
ko        71
sq        70
iw        59
ka        54
ms        52
bs        37
fa        35
bn        33
gl        32
kk        25
mk        23
nn        18
hi        18
aa        17
uz        17
so        15
af        12
eu        11
az         8
be         7
cy         7
hy         7
tt         6
ku         5
km         4
te         4
ky         4
ur         4
mg         3
ty         3
ta         3
tg         3
my         3
tl         3
mo         2
sc         2
ir         2
ne         2
tk         2
am         2
mn         2
co         2
se         2
si         2
fj         1
ch         1
ug         1
yi         1
to         1
fo         1
mt         1
ht         1
ak         1
jp         1
oc         1
lb         1
mi         1
as         1
yo         1
ga         1
gd         1
ba         1
zu         1
mr         1

Would it be possible to share the link to this original data set ? I am curious to have a look to it as well. Thanks!

korablique commented 4 weeks ago

Would it be possible to share the link to this original data set ? I am curious to have a look to it as well. Thanks!

I used the MongoDB dump. I described above how I retrieved the data from it. However, there might be an error in my script because some languages have fewer texts than expected (e.g. I got 912 samples of Japanese texts, but on https://jp-en.openfoodfacts.org/ there are around 16,000).

Please keep me posted if you’re planning to work on this task, as I’m actively working on it. You can find me on OFF slack (Yulia Zhilyaeva)

jeremyarancio commented 4 weeks ago

If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:

https://huggingface.co/datasets/openfoodfacts/product-database

korablique commented 3 weeks ago

If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:

https://huggingface.co/datasets/openfoodfacts/product-database

Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language?

My code here

``` other_lang_columns = [ 'ingredients_text_fr', 'ingredients_text_en', ... ] dataset_file = os.path.join(data_dir, 'data_from_hf.csv') for start, stop in tqdm(zip(range(0, 91, 10), range(10, 101, 10))): # read 10% of the dataset hf_dataset = load_dataset('openfoodfacts/product-database', split=f'main[{start}%:{stop}%]') # retrieve ingredients_text and lang ingredients_texts = hf_dataset['ingredients_text'] langs = hf_dataset['lang'] df = pd.DataFrame({'ingredients_text': ingredients_texts, 'lang': langs}) df.dropna(inplace=True) # retrieve ingredients_text_{LANG} for other_lang_col in other_lang_columns: lang = other_lang_col[-2:] other_lang_texts = hf_dataset[other_lang_col] other_lang_texts = [text for text in other_lang_texts if text is not None and len(text) > 0] new_rows = pd.DataFrame({'ingredients_text': other_lang_texts, 'lang': [lang] * len(other_lang_texts)}) df = pd.concat((df, new_rows), ignore_index=True) # save df.to_csv(dataset_file, mode='a', header=start == 0, index=False) ```

@jeremyarancio

jeremyarancio commented 3 weeks ago

If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features: https://huggingface.co/datasets/openfoodfacts/product-database

Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language?

My code here

other_lang_columns = [
    'ingredients_text_fr',
    'ingredients_text_en',
    ...
] 
dataset_file = os.path.join(data_dir, 'data_from_hf.csv')

for start, stop in tqdm(zip(range(0, 91, 10), range(10, 101, 10))):  
    # read 10% of the dataset
    hf_dataset = load_dataset('openfoodfacts/product-database', split=f'main[{start}%:{stop}%]')

    # retrieve ingredients_text and lang
    ingredients_texts = hf_dataset['ingredients_text']
    langs = hf_dataset['lang']

    df = pd.DataFrame({'ingredients_text': ingredients_texts, 'lang': langs})
    df.dropna(inplace=True)

    # retrieve ingredients_text_{LANG}
    for other_lang_col in other_lang_columns:
        lang = other_lang_col[-2:]
        other_lang_texts = hf_dataset[other_lang_col]
        other_lang_texts = [text for text in other_lang_texts if text is not None and len(text) > 0]

        new_rows = pd.DataFrame({'ingredients_text': other_lang_texts, 'lang': [lang] * len(other_lang_texts)})
        df = pd.concat((df, new_rows), ignore_index=True)

    # save
    df.to_csv(dataset_file, mode='a', header=start == 0, index=False)

@jeremyarancio

The Parquet contains the same information as the JSONL file, so it's not surprising. You also have the text in all languages as ingredients_text and ingredientstext{lang}

korablique commented 3 weeks ago

The Parquet contains the same information as the JSONL file, so it's not surprising. You also have the text in all languages as ingredients_text and ingredientstext{lang}

I see. I mean I don't understand why there https://jp-en.openfoodfacts.org/ are 16,000 products, and I have only 900 @jeremyarancio

korablique commented 3 weeks ago

Oh, seems that just not all of them have ingredients list in Japanese

korablique commented 3 weeks ago

I created a validation dataset from texts from OFF off_validation_dataset.csv (42 languages, 15-30 texts per language) and validated FastText and lingua models.

I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (this and this). For languages they don’t support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts).

Accuracy of the models: fasttext: 92.94% lingua: 93.79% (I used only these models, because according to some articles (this and this) comparing language identification models these's almost nothing better than fasttext)

Should I compare their accuracy on only short texts, or should I try to retrain fasttext? @raphael0202 @jeremyarancio

raphael0202 commented 2 weeks ago

Hello @korablique, thank you for the analysis!

So if I understood correctly, the lang field was obtained by querying Deepl and two other models, or checking manually?

And can you provide the metrics for each language?

For reference, using duckdb, I computed the number of items for each language: ┌─────────┬───────┐ │ lang │ count │ │ varchar │ int64 │ ├─────────┼───────┤ │ fi │ 30 │ │ nl │ 30 │ │ pl │ 30 │ │ hr │ 30 │ │ pt │ 30 │ │ es │ 30 │ │ en │ 30 │ │ de │ 30 │ │ fr │ 30 │ │ it │ 30 │ │ cs │ 30 │ │ sv │ 29 │ │ da │ 29 │ │ he │ 29 │ │ nb │ 29 │ │ sl │ 28 │ │ et │ 28 │ │ lv │ 28 │ │ bg │ 28 │ │ ja │ 28 │ │ tr │ 27 │ │ hu │ 27 │ │ ru │ 26 │ │ vi │ 26 │ │ zh │ 25 │ │ is │ 25 │ │ th │ 24 │ │ no │ 24 │ │ ro │ 24 │ │ sr │ 24 │ │ uk │ 23 │ │ ko │ 22 │ │ ar │ 22 │ │ sk │ 22 │ │ lt │ 21 │ │ ka │ 17 │ │ el │ 17 │ │ bn │ 17 │ │ ca │ 17 │ │ bs │ 16 │ │ sq │ 15 │ │ id │ 15 │ ├─────────┴───────┤ │ 42 rows │ └─────────────────┘

raphael0202 commented 2 weeks ago

I've just added to the Python SDK a new method to analyze the ingredients in a given language: https://openfoodfacts.github.io/openfoodfacts-python/usage/#perform-ingredient-analysis

Using the is_in_taxonomy field for each detected ingredient, you can compute easily the number of ingredients recognized or not, and spot ingredient lists that are not in the right language. It can help you detect errors in your validation set or increase its size.

edit: you need the latest version of the SDK for it to work, openfoodfacts==2.1.0

jeremyarancio commented 2 weeks ago

Gj @korablique Since the distribution is not uniform, it would be preferable to compute the Precision & Recall for each lang to have a better understanding of which languages the models struggle with. Also, based on the initial issue description, it seems the language prediction is often wrong when the text is quite short. Having Precision and Recall depending on the text length (<10 words, 10 - 20 words, >20 words, for example) could be insightful.

korablique commented 2 weeks ago

So if I understood correctly, the lang field was obtained by querying Deepl and two other models, or checking manually?

Yes

And can you provide the metrics for each language?

lang | count | fasttext_precision | fasttext_recall | lingua_precision | lingua_recall -- | -- | -- | -- | -- | -- no | 53 | 0.980392 | 0.943396 | 0.961538 | 0.943396 en | 30 | 0.933333 | 0.933333 | 0.947368 | 0.600000 nl | 30 | 0.937500 | 1.000000 | 0.937500 | 1.000000 pl | 30 | 1.000000 | 1.000000 | 0.967742 | 1.000000 it | 30 | 0.966667 | 0.966667 | 1.000000 | 0.966667 pt | 30 | 1.000000 | 0.900000 | 1.000000 | 0.900000 hr | 30 | 0.689655 | 0.666667 | 0.531915 | 0.833333 fr | 30 | 1.000000 | 0.933333 | 1.000000 | 0.966667 es | 30 | 0.931034 | 0.900000 | 1.000000 | 0.900000 fi | 30 | 0.931034 | 0.900000 | 1.000000 | 0.933333 de | 30 | 1.000000 | 0.966667 | 0.937500 | 1.000000 cs | 30 | 0.964286 | 0.900000 | 0.937500 | 1.000000 he | 29 | 1.000000 | 1.000000 | 1.000000 | 1.000000 da | 29 | 0.965517 | 0.965517 | 0.928571 | 0.896552 sv | 29 | 0.966667 | 1.000000 | 0.966667 | 1.000000 sl | 28 | 0.931034 | 0.964286 | 0.928571 | 0.928571 et | 28 | 0.965517 | 1.000000 | 0.965517 | 1.000000 lv | 28 | 1.000000 | 0.928571 | 1.000000 | 0.892857 bg | 28 | 1.000000 | 0.892857 | 1.000000 | 0.964286 ja | 28 | 0.833333 | 0.357143 | 1.000000 | 0.892857 hu | 27 | 0.928571 | 0.962963 | 1.000000 | 0.962963 tr | 27 | 0.962963 | 0.962963 | 1.000000 | 0.925926 ru | 26 | 1.000000 | 0.961538 | 0.962963 | 1.000000 vi | 26 | 0.928571 | 1.000000 | 1.000000 | 0.923077 zh | 25 | 0.517241 | 0.600000 | 0.892857 | 1.000000 is | 25 | 1.000000 | 1.000000 | 1.000000 | 1.000000 ro | 24 | 1.000000 | 0.916667 | 1.000000 | 0.833333 sr | 24 | 0.000000 | 0.000000 | 0.000000 | 0.000000 th | 24 | 1.000000 | 1.000000 | 1.000000 | 1.000000 uk | 23 | 1.000000 | 1.000000 | 1.000000 | 0.913043 ko | 22 | 1.000000 | 0.909091 | 1.000000 | 1.000000 sk | 22 | 0.846154 | 1.000000 | 1.000000 | 0.863636 ar | 22 | 0.916667 | 1.000000 | 1.000000 | 0.954545 lt | 21 | 1.000000 | 1.000000 | 0.913043 | 1.000000 ka | 17 | 1.000000 | 1.000000 | 1.000000 | 1.000000 el | 17 | 1.000000 | 1.000000 | 1.000000 | 1.000000 ca | 17 | 0.882353 | 0.882353 | 0.937500 | 0.882353 bn | 17 | 1.000000 | 0.764706 | 1.000000 | 1.000000 bs | 16 | 0.300000 | 0.750000 | 0.235294 | 0.250000 id | 15 | 1.000000 | 0.933333 | 1.000000 | 0.800000 sq | 15 | 1.000000 | 0.866667 | 0.933333 | 0.933333 Serbian (sr), Bosnian (bs) and Croatian (hr) are very similar, so models confuse them. I talked to a friend from Serbia and he said that basically they are the same language with only tiny variations. Also, I considered the variants of Norwegian as one language. Sorry, I didn't think to filter only short texts from the beginning. I'll calculate the metrics again after I improve the dataset

baslia commented 2 weeks ago

It seems like good results, congrats! If I can suggest some way of improvement:

You can compute the AUC ROC, there are different way to do it as it is a multi classification problem. But this can be relevant to see how sensitive the model is to the threshold
You can weight the loss function, to give more importance to certain language, or to balance minority samples.

raphael0202 commented 2 weeks ago

I would suggest also adding f1-score as a metric!

korablique commented 2 weeks ago

Recalculated metrics on only short texts (no more than 10 words). 30 texts per language.

lang | fasttext_precision | lingua_precision | fasttext_recall | lingua_recall | fasttext_f1 | lingua_f1 -- | -- | -- | -- | -- | -- | -- ar | 0.964286 | **1.000000** | **0.900000** | 0.866667 | **0.931035** | 0.928572 bg | **1.000000** | 0.965517 | 0.633333 | **0.933333** | 0.775510 | **0.949152** ca | 0.769231 | **0.913043** | 0.666667 | **0.700000** | 0.714286 | **0.792453** cs | 0.941176 | **1.000000** | 0.533333 | **0.833333** | 0.680851 | **0.909091** da | 0.800000 | **0.818182** | 0.800000 | **0.900000** | 0.800000 | **0.857143** de | 0.717949 | **0.906250** | 0.933333 | **0.966667** | 0.811594 | **0.935484** en | 0.571429 | **0.896552** | 0.800000 | **0.866667** | 0.666667 | **0.881356** es | 0.807692 | **0.941176** | **0.700000** | 0.533333 | **0.750000** | 0.680851 fi | 0.903226 | **0.933333** | 0.933333 | 0.933333 | 0.918033 | **0.933333** fr | 0.842105 | **0.888889** | 0.533333 | **0.800000** | 0.653061 | **0.842105** hr | **1.000000** | 0.952381 | 0.400000 | **0.666667** | 0.571429 | **0.784314** hu | 0.964286 | **1.000000** | **0.900000** | 0.866667 | **0.931035** | 0.928572 it | **1.000000** | 0.960000 | **0.900000** | 0.800000 | **0.947368** | 0.872727 ja | 1.000000 | 1.000000 | 0.233333 | **0.700000** | 0.378378 | **0.823529** lv | 1.000000 | 1.000000 | 0.766667 | **0.866667** | 0.867925 | **0.928572** no | 0.720000 | **0.800000** | 0.600000 | **0.666667** | 0.654545 | **0.727273** nl | **0.880000** | 0.833333 | **0.733333** | 0.666667 | **0.800000** | 0.740741 pl | 0.966667 | **1.000000** | 0.966667 | **1.000000** | 0.966667 | **1.000000** pt | **0.944444** | 0.696970 | 0.566667 | **0.766667** | 0.708333 | **0.730159** ro | 0.956522 | **0.961538** | 0.733333 | **0.833333** | 0.830189 | **0.892857** ru | 0.961538 | **1.000000** | 0.833333 | 0.833333 | 0.892857 | **0.909091** sv | 0.958333 | **0.961538** | 0.766667 | **0.833333** | 0.851852 | **0.892857**

raphael0202 commented 1 week ago

@korablique Can you publish the source code and your results in this repo? in a new langid folder.

korablique commented 1 week ago

@korablique Can you publish the source code and your results in this repo? in a new langid folder.

Yes, I remember. I am preparing the code. Haven't published it yet because of the problem with the huggingface dataset. Plan to publish the code this week

openfoodfacts / openfoodfacts-ai

Train a language identifier model that works well on ingredient lists #349

Problem

Requirements