huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Allow hf_hub 0.18 #1383

Closed mariosasko closed 8 months ago

mariosasko commented 8 months ago

To address https://discuss.huggingface.co/t/exceeded-our-hourly-quotas-for-action-while-loading-dataset-to-hf-hub/60698/2

HuggingFaceDocBuilderDev commented 8 months ago

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker commented 8 months ago

cc @Narsil I'm down to allow <1.0 is that alright with you? (hf hub goes pretty fast from one the the other as it's minor versions)

Narsil commented 8 months ago

As said internally, it' s not up to us, but to @Wauplin .

Allowing <1.0 means no breaking change can happen between now and 1.0 (which I' m not sure is even planned). The goal to have those in place is so that hf-hub can break some stuff if it needs to.

Wauplin commented 8 months ago

Thanks @Narsil for the ping. I have mixed feelings about Allowing <1.0 means no breaking change can happen between now and 1.0. The usual policy in huggingface_hub is to deprecate for (at least) 3 minor releases before introducing a breaking change to the API. When that is the case then yes, any library that had huggingface_hub<1.0 in their dependency requirements breaks. This is the case for transformers/diffusers/datasets libraries meaning that the policy is more or less "we don't care about breaking out of date versions as long as a few last versions of each lib work correctly with each other". This is not best practice as we brake previous versions of the libs from time to time but it worked so far.

That been said, I would prefer that we align the policy for tokenizers with the rest of HF-ecosystem to avoid users being blocked on earlier versions of huggingface_hub (as it has been the case for 0.18.0). I can see two solutions here:

  1. Either we set a maximum version for huggingface_hub that should always be set to "latest version + 3 releases". In the current situation, that means huggingface_hub>=0.16.4,<0.22.0. And align this policy in the other libraries from the HF-ecosystem. If we do that, we should have a Github Action to automatically open PRs to update the requirements.
  2. Either we don't care to break earlier versions of tokenizers as long as we respect the "3 minor versions deprecation cycle" in huggingface_hub. So setting huggingface_hub>=0.16.4,<1.0.0. This is already the case in the HF-ecosystem which allows to iterate faster while been less robust.

In both cases, what I don't want (as you mentioned) is to be blocked in huggingface_hub to update things when needed. The API is far more stable than 1 year ago but I'd prefer not to have it set in stone.

Please let me know what you think. (cc @LysandreJik as well for other OS libraries)