rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.62k stars 336 forks source link

provide more distributed strategies #135

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

for example

follow up of https://github.com/rom1504/img2dataset/issues/20

rom1504 commented 2 years ago

With new information I gathered, the more important thing here would be to make it as easy as possible to make img2dataset usable in a swarm environnement rather than a cluster: many varied kind of nodes connecting and helping out for a while then disconnecting. This is already kind of working thanks to spark dynamic allocation feature but it could be better tested and better documented / easier to run. Ideally it would even be possible to do this kind of stuff in a trustless fashion, but this would probably require a lot more engineering than trustful but unreliable

Being able to handle unreliable resources would unlock combining many different resources rather than needing to allocate a lot of resources in a single place

rom1504 commented 2 years ago

https://pytorch.org/tutorials/intermediate/dist_tuto.html