mrdbourke / nutrify

Take a photo of food and learn about it.
https://nutrify.app
MIT License
181 stars 34 forks source link

Autolabelling: Setup a data collection pipeline (e.g. what happens when new data comes in?) #64

Open mrdbourke opened 1 year ago

mrdbourke commented 1 year ago

Data collection pipeline should be reactive to data coming into a bucket, for example:

Images get added to bucket -> autolabelling pipeline happens for unlabelled images -> labelling cleaning happens -> model training pipeline happens for when all images are labelled -> evaluation pipeline happens -> deployment happens

See:

mrdbourke commented 1 year ago

Potential autolabelling pipeline:

Could use the pipeline above with multiple variants of CLIP-style models for redundancy.

mrdbourke commented 1 year ago

See openclip for zero-shot classification: https://github.com/mlfoundations/open_clip

Also see clip-retrieval for just embedding/searching a large existing dataset for images specific to a certain task: https://github.com/rom1504/clip-retrieval

Can download a large number of images from web links using: https://github.com/rom1504/img2dataset

mrdbourke commented 1 year ago

Much better to compute image embeddings + class embeddings up front.

Then reuse over time where necessary.

This could be setup via:

mrdbourke commented 1 year ago

See this resource for autolabelling object detection: https://github.com/facebookresearch/CutLER