Autolabelling: Setup a data collection pipeline (e.g. what happens when new data comes in?)

mrdbourke / nutrify

Take a photo of food and learn about it.

https://nutrify.app

MIT License

181 stars 34 forks source link

Autolabelling: Setup a data collection pipeline (e.g. what happens when new data comes in?) #64

Open mrdbourke opened 1 year ago

mrdbourke commented 1 year ago

Data collection pipeline should be reactive to data coming into a bucket, for example:

Images get added to bucket -> autolabelling pipeline happens for unlabelled images -> labelling cleaning happens -> model training pipeline happens for when all images are labelled -> evaluation pipeline happens -> deployment happens

See:

Modal Cron jobs for watching a storage bucket for changes: https://modal.com/docs/guide/cron
Google/TensorFlow continuous adaptation for ML system to data changes (watch bucket, do X if something happens): https://blog.tensorflow.org/2021/12/continuous-adaptation-for-machine.html

mrdbourke commented 1 year ago

Potential autolabelling pipeline:

raw images downloaded (e.g. filtered images from large dataset, such as, LAION-COCO)
several rounds of zero-shot classification are run to further filter images
- "edible_food" vs "other" (only keep images which contain edible food
- "contains_logo" vs "other" (remove images with logos/text)
- "apple" vs "banana" ... (label images with their appropriate class name)

Could use the pipeline above with multiple variants of CLIP-style models for redundancy.

mrdbourke commented 1 year ago

See openclip for zero-shot classification: https://github.com/mlfoundations/open_clip

Also see clip-retrieval for just embedding/searching a large existing dataset for images specific to a certain task: https://github.com/rom1504/clip-retrieval

Can download a large number of images from web links using: https://github.com/rom1504/img2dataset

mrdbourke commented 1 year ago

Much better to compute image embeddings + class embeddings up front.

Then reuse over time where necessary.

This could be setup via:

image gets given UUID
image embedding gets calculated
if the image UUID has an existing embedding, use that (can force to compute new if necessary)

mrdbourke commented 1 year ago

See this resource for autolabelling object detection: https://github.com/facebookresearch/CutLER