pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

Consider pinning your spaCy version in requirements.txt? #178

Open honnibal opened 7 years ago

honnibal commented 7 years ago

I just noticed that your requirements.txt doesn't pin to any particular version of spaCy or NLTK.

We've recently pushed spaCy 2, and while we've endeavoured to keep breaking changes to a minimum, it's a pretty big release: https://github.com/explosion/spaCy/releases/tag/v2.0.2

Even if the API doesn't change, there's the potential for problematic train/test skew for you if we make bug fixes to the tokenization, especially for languages other than English. Our compatibility policy is that changes that can affect statistical models can be made on minor releases --- e.g. spaCy 2.1.0 might fix some bug in the Hungarian tokenizer that affects a large number of tokens for that language. This means that sometimes, models trained with one minor version will suffer decreased accuracy if another version of the library is used at runtime.

There are also potential performance considerations. There's currently an open ticket about performance degradation of the tokenizer. It's unfortunate that this problem made it into the release, and we're working on it. But in the meantime, users who make a new installation of torch.text might find their preprocessing is much slower.

jekbradbury commented 6 years ago

Our policy so far has been to treat SpaCy and NLTK as optional dependencies and use whatever version the user's already working with/already has installed. Choosing the "spacy" tokenizer option is a convenience function for manually creating a lambda that calls SpaCy's English tokenizer. But that's not actually incompatible with providing a version in requirements.txt, since the optional dependencies there aren't installed or checked by pip install torchtext, so we'll go ahead and pin.