piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Data release torrents #47

Open jxu opened 3 years ago

jxu commented 3 years ago

Have you considered using torrents to get around filesize limits and improve download speed?

piskvorky commented 3 years ago

We haven't but that's an interesting idea.

Somebody would need to pick it up and execute though, with strong SW engineering. It's not trivial because we definitely want to keep the current "HTTP download from Github" (compatibility, universality). Torrent is more fringe, so we'd have to keep both approaches and sync them.

A lot of work IMO, for not much benefit. How much faster do you need the download to be? How often do you download?

jxu commented 3 years ago

I don't download that much actually and I actually have very fast internet. But it's an option if GitHub ever complains about how every much bandwidth is used.

piskvorky commented 3 years ago

I don't download that much actually and I actually have very fast internet.

Interesting. What motivated you to open this ticket then, how did you think of it?

jxu commented 3 years ago

I was downloading a different dataset from a university and the bandwidth was not as high as GitHub CDN. Which made me think of how Linux distros will release torrents as an alternative to sponsored mirrors. Also I have done some analysis of large image datasets that come as torrents, the advantage being it is possible to download subsets of files without downloading everything.