rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.6k stars 334 forks source link

Refactor as a (self hosted) service #339

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

https://github.com/rom1504/img2dataset/tree/streaming_refacto some work I started on that some 8 months ago I still think it's the right direction

Screenshot_20230820_233013

may try to finish it soon

would close #82 #188 and #135

rom1504 commented 1 year ago

in term of implementation, maybe ray eg #272 can help guide things / compare with the http path

rom1504 commented 11 months ago

https://github.com/ml6team/fondant is doing some good things in term of packaging, a bit similar to what jina is doing. They're doing docker though which is not clear how it can be made to work

looks like they're becoming dependent on dask though https://github.com/ml6team/fondant/blob/main/components/load_from_parquet/src/main.py which is what I'd like to avoid (being dependent on any given distribution framework)

https://github.com/ml6team/fondant/blob/main/src/fondant/component.py

they're also locking themselves into kubeflow

things to consider taking from them in term of design: