piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Using downloaded resources shouldn't require internet access #23

Closed piskvorky closed 4 years ago

piskvorky commented 6 years ago

As seen during our workshop yesterday, various network issues can appear during live or even offline events.

Once a user had downloaded a dataset onto their machine (~/gensim-data), they shouldn't require any internet access to use it. If the API needs to do some "online checking", this checking should be optional.

menshikh-iv commented 6 years ago

Let me clarify, If the user already download a model, internet connection used for

I agree about the check, this should be optional (but True by default, anyway, we must be sure that the data is correct, but the user should be able to disable this check at one's own risk).

DSamuylov commented 6 years ago

I also encountered this problem. I was going in a trip where I would have only no/very weak internet connection. I preloaded all the models before the trip hoping to still work on my project. I was caught by a big surprise when I realised I couldn't work without internet!! My easter holidays are over when they didn't even started... I have to find what to de without my laptop :)

I agree that consistency is important, but possible solution would be: 1) try if there is an internet connection, 2) if 1 fails, try to load from default location with some default model name 3) if 2 fails throw exception that the model cannot be found. I am very new to this package, but I guess the default location shouldn't change for many users?

It would be also great to have some custom exceptions telling what went wrong. Otherwise it is not really obvious why it fails. If you need help I could look into the source code and try to fix it when I am back.

menshikh-iv commented 6 years ago

@DSamuylov I agree, we definitely need to add a special flag for this case, feel free to contribute (need to add "persistence" flag to https://github.com/RaRe-Technologies/gensim/blob/10a3dab8d00c0523ff871af75fb0badcff14848b/gensim/downloader.py#L357)

piskvorky commented 6 years ago

I agree with @DSamuylov . I didn't realize gensim-data depends on an internet connection, that's bad design. The way I see it, we need two things:

  1. Fix the design so that internet is not mandatory for already-downloaded models.

  2. Better, clear progress/error messages, so users know what's going on. The errors we saw during the workshop were really terrible. Nobody knew what's going on.

mpenkov commented 4 years ago

Fixed via https://github.com/RaRe-Technologies/gensim/pull/2545