Open raphael0202 opened 2 months ago
What I'm going to try (following the discussion on Slack):
@raphael0202 If that's OK with you, could you please assign the issue to me?
Yes that's a good plan to start :+1: I've assigned you on this issue
Hey, is that a requirements that the models needs to run locally (not from a server) ? It seems that LLMs are good with detecting languages.
Tried to use clustering to fix mislabeled data.
I took the languages for which there are at least 100 texts (37 languages). Then took 100 texts for each language and used them as a training dataset (wanted then to get predictions for the entire dataset).
The texts were converted to embeddings using fasttext (get_sentence_vector method), the dimension was reduced from 256 to 66 to preserve 95% variance using PCA. Tried 2 methods: gaussian mixture and HDBSCAN. Gaussian mixture divides the data into only 3 clusters. And HDBSCAN classifies all new data as noise. The picture below shows the result of HDBSCAN clustering of training data. The clusters are difficult to separate.
Either clustering is not suitable for this task, or I am doing something wrong.
Now I will try to use another text classification model: lingua https://github.com/pemistahl/lingua-py to compare the predictions and confidence of two models. Then I'll take data in which the predictions of the models coincide and they are both confident and fine-tune one of them on this data.
Here's a really nice article summarizing different approaches for language detection, from statistical to deep learning https://medium.com/besedo-engineering/language-identification-for-very-short-texts-a-review-c9f2756773ad
Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/
How I got the distribution of texts languages:
selected ingredients_text_{LANG}
field names from mongo DB:
docker exec -i mongodb-container mongo mydatabase --quiet --eval '
var cursor = db.mycollection.aggregate([
{ "$project": {
"fields": { "$objectToArray": "$$ROOT" }
}},
{ "$unwind": "$fields" },
{ "$match": { "fields.k": /^ingredients_text_/ }},
{ "$group": {
"_id": null,
"all_fields": { "$addToSet": "$fields.k" }
}},
{ "$limit": 20 }
]);
if (cursor.hasNext()) {
printjson(cursor.next().all_fields);
}' > field_names.json
then selected field values:
FIELDS=$(jq -r '.[]' field_names.json | paste -sd "," -)
docker exec -i mongodb-container mongo mydatabase --quiet --eval '
var fields = "'$FIELDS'".split(",");
var projection = {};
fields.forEach(function(field) { projection[field] = 1; });
var cursor = db.mycollection.find({}, projection).forEach(function(doc) { var cleanedDoc = {}; fields.forEach(function(field) { if (doc[field] && doc[field] !== "") { cleanedDoc[field] = doc[field]; } }); if (Object.keys(cleanedDoc).length > 0) { printjson(cleanedDoc); } });' > filtered_extracted_values.json
(but after that there are still some extra fields left, e.x. `ingredients_text_with_allergens`)
3. then I made a dictionary in which text is the key, language is the value:
ingredients_text_lang_dct = dict()
with open(os.path.join(data_dir, 'filtered_extracted_values.json'), 'r') as data_file: for dct in ijson.items(data_file, 'item'): for k, v in dct.items(): if k == 'ingredients_text_withallergens': continue lang = k[k.rfind('') + 1:]
# if the field is `ingredients_text_{LANG}_imported`
if lang == 'imported':
start = k[:k.rfind('_')].rfind('_') + 1
end = k.rfind('_')
lang = k[start:end]
ingredients_text_lang_dct.update({v: lang})
@raphael0202
Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/
How many samples should it contain? Should I select an equal number of samples for each language or just random? @jeremyarancio
Roughly 30 labels per language to start with I would say. It's just to have an idea about the performances
Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/
How many samples should it contain? Should I select an equal number of samples for each language or just random? @jeremyarancio
Here is the number of texts for each language
en 422020 fr 299681 de 89880 es 46255 it 31801 nl 19983 pl 8401 pt 8119 sv 6128 bg 4453 ro 3771 fi 3726 ru 3610 nb 3591 cs 3500 th 3157 da 2021 hr 2015 hu 1962 ar 1104 el 943 ja 912 ca 824 sr 735 sl 727 sk 606 tr 506 lt 453 zh 436 et 370 lv 333 xx 318 no 315 uk 274 id 262 he 209 vi 121 is 113 la 89 in 72 ko 71 sq 70 iw 59 ka 54 ms 52 bs 37 fa 35 bn 33 gl 32 kk 25 mk 23 nn 18 hi 18 aa 17 uz 17 so 15 af 12 eu 11 az 8 be 7 cy 7 hy 7 tt 6 ku 5 km 4 te 4 ky 4 ur 4 mg 3 ty 3 ta 3 tg 3 my 3 tl 3 mo 2 sc 2 ir 2 ne 2 tk 2 am 2 mn 2 co 2 se 2 si 2 fj 1 ch 1 ug 1 yi 1 to 1 fo 1 mt 1 ht 1 ak 1 jp 1 oc 1 lb 1 mi 1 as 1 yo 1 ga 1 gd 1 ba 1 zu 1 mr 1
Would it be possible to share the link to this original data set ? I am curious to have a look to it as well. Thanks!
Would it be possible to share the link to this original data set ? I am curious to have a look to it as well. Thanks!
I used the MongoDB dump. I described above how I retrieved the data from it. However, there might be an error in my script because some languages have fewer texts than expected (e.g. I got 912 samples of Japanese texts, but on https://jp-en.openfoodfacts.org/ there are around 16,000).
Please keep me posted if youβre planning to work on this task, as Iβm actively working on it. You can find me on OFF slack (Yulia Zhilyaeva)
If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:
https://huggingface.co/datasets/openfoodfacts/product-database
If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:
https://huggingface.co/datasets/openfoodfacts/product-database
Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language?
``` other_lang_columns = [ 'ingredients_text_fr', 'ingredients_text_en', ... ] dataset_file = os.path.join(data_dir, 'data_from_hf.csv') for start, stop in tqdm(zip(range(0, 91, 10), range(10, 101, 10))): # read 10% of the dataset hf_dataset = load_dataset('openfoodfacts/product-database', split=f'main[{start}%:{stop}%]') # retrieve ingredients_text and lang ingredients_texts = hf_dataset['ingredients_text'] langs = hf_dataset['lang'] df = pd.DataFrame({'ingredients_text': ingredients_texts, 'lang': langs}) df.dropna(inplace=True) # retrieve ingredients_text_{LANG} for other_lang_col in other_lang_columns: lang = other_lang_col[-2:] other_lang_texts = hf_dataset[other_lang_col] other_lang_texts = [text for text in other_lang_texts if text is not None and len(text) > 0] new_rows = pd.DataFrame({'ingredients_text': other_lang_texts, 'lang': [lang] * len(other_lang_texts)}) df = pd.concat((df, new_rows), ignore_index=True) # save df.to_csv(dataset_file, mode='a', header=start == 0, index=False) ```
@jeremyarancio
If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features: https://huggingface.co/datasets/openfoodfacts/product-database
Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language?
My code here
other_lang_columns = [ 'ingredients_text_fr', 'ingredients_text_en', ... ] dataset_file = os.path.join(data_dir, 'data_from_hf.csv') for start, stop in tqdm(zip(range(0, 91, 10), range(10, 101, 10))): # read 10% of the dataset hf_dataset = load_dataset('openfoodfacts/product-database', split=f'main[{start}%:{stop}%]') # retrieve ingredients_text and lang ingredients_texts = hf_dataset['ingredients_text'] langs = hf_dataset['lang'] df = pd.DataFrame({'ingredients_text': ingredients_texts, 'lang': langs}) df.dropna(inplace=True) # retrieve ingredients_text_{LANG} for other_lang_col in other_lang_columns: lang = other_lang_col[-2:] other_lang_texts = hf_dataset[other_lang_col] other_lang_texts = [text for text in other_lang_texts if text is not None and len(text) > 0] new_rows = pd.DataFrame({'ingredients_text': other_lang_texts, 'lang': [lang] * len(other_lang_texts)}) df = pd.concat((df, new_rows), ignore_index=True) # save df.to_csv(dataset_file, mode='a', header=start == 0, index=False)
The Parquet contains the same information as the JSONL file, so it's not surprising.
You also have the text in all languages as ingredients_text
and ingredientstext{lang}
The Parquet contains the same information as the JSONL file, so it's not surprising. You also have the text in all languages as
ingredients_text
and ingredientstext{lang}
I see. I mean I don't understand why there https://jp-en.openfoodfacts.org/ are 16,000 products, and I have only 900 @jeremyarancio
Oh, seems that just not all of them have ingredients list in Japanese
I created a validation dataset from texts from OFF off_validation_dataset.csv (42 languages, 15-30 texts per language) and validated FastText and lingua models.
I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (this and this). For languages they donβt support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts).
Accuracy of the models: fasttext: 92.94% lingua: 93.79% (I used only these models, because according to some articles (this and this) comparing language identification models these's almost nothing better than fasttext)
Should I compare their accuracy on only short texts, or should I try to retrain fasttext? @raphael0202 @jeremyarancio
Hello @korablique, thank you for the analysis!
So if I understood correctly, the lang
field was obtained by querying Deepl and two other models, or checking manually?
And can you provide the metrics for each language?
For reference, using duckdb, I computed the number of items for each language: βββββββββββ¬ββββββββ β lang β count β β varchar β int64 β βββββββββββΌββββββββ€ β fi β 30 β β nl β 30 β β pl β 30 β β hr β 30 β β pt β 30 β β es β 30 β β en β 30 β β de β 30 β β fr β 30 β β it β 30 β β cs β 30 β β sv β 29 β β da β 29 β β he β 29 β β nb β 29 β β sl β 28 β β et β 28 β β lv β 28 β β bg β 28 β β ja β 28 β β tr β 27 β β hu β 27 β β ru β 26 β β vi β 26 β β zh β 25 β β is β 25 β β th β 24 β β no β 24 β β ro β 24 β β sr β 24 β β uk β 23 β β ko β 22 β β ar β 22 β β sk β 22 β β lt β 21 β β ka β 17 β β el β 17 β β bn β 17 β β ca β 17 β β bs β 16 β β sq β 15 β β id β 15 β βββββββββββ΄ββββββββ€ β 42 rows β βββββββββββββββββββ
I've just added to the Python SDK a new method to analyze the ingredients in a given language: https://openfoodfacts.github.io/openfoodfacts-python/usage/#perform-ingredient-analysis
Using the is_in_taxonomy
field for each detected ingredient, you can compute easily the number of ingredients recognized or not, and spot ingredient lists that are not in the right language. It can help you detect errors in your validation set or increase its size.
edit: you need the latest version of the SDK for it to work, openfoodfacts==2.1.0
Gj @korablique Since the distribution is not uniform, it would be preferable to compute the Precision & Recall for each lang to have a better understanding of which languages the models struggle with. Also, based on the initial issue description, it seems the language prediction is often wrong when the text is quite short. Having Precision and Recall depending on the text length (<10 words, 10 - 20 words, >20 words, for example) could be insightful.
So if I understood correctly, the
lang
field was obtained by querying Deepl and two other models, or checking manually?
Yes
And can you provide the metrics for each language?
It seems like good results, congrats! If I can suggest some way of improvement:
I would suggest also adding f1-score as a metric!
Recalculated metrics on only short texts (no more than 10 words). 30 texts per language.
@korablique Can you publish the source code and your results in this repo? in a new langid
folder.
@korablique Can you publish the source code and your results in this repo? in a new
langid
folder.
Yes, I remember. I am preparing the code. Haven't published it yet because of the problem with the huggingface dataset. Plan to publish the code this week
Problem
We're currently using fasttext for language identification. This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.
However, fasttext was trained on data that is quite different from ingredient lists (Wikipedia, Tatoeba and SETimes).
Sometimes the model fails for obvious cases, such as this one (french ingredient list):
This behaviour is mostly present for short ingredient lists.
We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).
Requirements
Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.