openfoodfacts / off-category-classification

GNU Affero General Public License v3.0
8 stars 5 forks source link

feat: Update off_categories dataset to latest data #67

Closed streino closed 2 years ago

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

streino commented 2 years ago

Here are the results from a full run of Train.ipynb with the new dataset:

loss: 0.0010 - binary_accuracy: 0.9997 - precision: 0.8762 - recall: 0.7635 - 
val_loss: 0.0012 - val_binary_accuracy: 0.9997 - val_precision: 0.8754 - val_recall: 0.7581

Compared to current performance on master:

loss: 0.0012 - binary_accuracy: 0.9996 - precision: 0.8870 - recall: 0.7940 - 
val_loss: 0.0014 - val_binary_accuracy: 0.9996 - val_precision: 0.8899 - val_recall: 0.7895

So not exactly the same performance but not too far either. We go from 3,969 to 5,205 categories, so a drop in performance is not so surprising.

Note: That new run is with a 5K limit to the 'ingredients_tags' vocabulary. The full vocabulary went from 4,222 tokens (current dataset using 'known_ingredients_tags') to 49K tokens (new dataset using 'ingredients_tags'), so we're capping it to 5K.