Open rom1504 opened 3 years ago
Could even consider having some optional stats computation and even trainings at the end. A full dataset construction pipeline starting from urls. Important point:
result will look like this https://colab.research.google.com/github/criteo/autofaiss/blob/master/docs/notebooks/autofaiss_multimodal_search.ipynb but bette packaged
https://github.com/rom1504/img2dataset#api
the config is important. Figure out a way to expose all options and yet do the right things by default with as few as possible necessary arguments to pass
it should be possible to do url list -> index + metadata store at almost no memory usage and in one step when
would look like this:
clip-retrieval end2end <url list> <output path>
that would use for laion400m:
interesting possibility but might not be that important vs working incrementally
the basic is now done
next:
config modes:
Each one will be useful for maximum convenience or configurability
also consider the option of having end2end be an example and let people do their preferred config in python
Url list -> filtering (dedup) -> downloading -> clip inference -> indexing -> back + front (subprocess or host with back too)
clip-retrieval end2end <url list> <config.json>
It would start a prefect UI with what's going on and wandb links for each subtask Then after a small while, it will start the back and front and open the demo in the browser
Build it with prefect, use a good config framework (fromconfig ?)
Would be ideal to make it incremental and schedulable too. Making it distributed potentially could also be interesting but not necessary.