Open mrdbourke opened 1 year ago
Potential autolabelling pipeline:
Could use the pipeline above with multiple variants of CLIP-style models for redundancy.
See openclip
for zero-shot classification: https://github.com/mlfoundations/open_clip
Also see clip-retrieval
for just embedding/searching a large existing dataset for images specific to a certain task: https://github.com/rom1504/clip-retrieval
Can download a large number of images from web links using: https://github.com/rom1504/img2dataset
Much better to compute image embeddings + class embeddings up front.
Then reuse over time where necessary.
This could be setup via:
See this resource for autolabelling object detection: https://github.com/facebookresearch/CutLER
Data collection pipeline should be reactive to data coming into a bucket, for example:
Images get added to bucket -> autolabelling pipeline happens for unlabelled images -> labelling cleaning happens -> model training pipeline happens for when all images are labelled -> evaluation pipeline happens -> deployment happens
See: