google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.28k stars 1.18k forks source link

Patches carried by conda-forge for packaging sentencepiece #876

Open h-vetinari opened 1 year ago

h-vetinari commented 1 year ago

Hi there - I help package many a package for conda-forge, which is a pretty big ecosystem of packages especially in data science and ML/AI. It started out mainly focussed on Python, but has come to include many other languages as well.

One strong point (especially in comparison to pip) is that we are able to sanely redistribute non-Python artefacts. We can track the relevant ABI of a package and rebuild its descendant packages where necessary. This allows us to (almost) exclusively distribute shared libraries, which lowers package footprints, and has several other benefits.

We've been shipping sentencepiece for about 3 years through the so-called feedstock, and packages can be installed like:

conda install -c conda-forge sentencepiece  # or mamba instead of conda

We have builds for linux-{64, aarch64, ppc64le}, osx-{64, arm64}, win-64, multiplied by CPython 3.8-3.11 as well as PyPy 3.8 & 3.9.

In the context of wanting to package torchtext (which depends on sentencepiece -- but really only the C++ bits) I wanted to split off a separate "output" / "package" called libsentencepiece that torchtext can build against, without having to spuriously rebuild it per python version (it would literally be the same in each case).

Also, crucially, we absolutely need to rely on our own abseil & protobuf builds, which became a pretty big sticking point. I finally managed to pull off this split last year (there's references there to my very long-winded previous tries; eventually I gave up on trying to build a shared library on windows because it was getting too insane[^1]).

[^1]: by that point I was already patching the output generated by protobuf for some silly constexpr errors

However, at the time, I ran out of steam to raise an issue here, and discuss how - if at all - it would be possible to upstream some of these changes. That's all the more the case because some of the patches I came up with are definitely not ready for primetime here. I got quite exasperated with how the code in thirdparty/absl was neither third-party nor abseil proper[^3], so I ended up replacing that wholesale. Of course that doesn't mean I'm suggesting to do this here verbatim.

[^3]: as force-replacing it with actual abseil lead to build errors because -- to the best of my understanding -- it contains sentencepiece-specific glue-code, but in the absl namespace.

However, the abseil situation is perhaps one of the biggest sticking points. There's a variable called SPM_USE_EXTERNAL_ABSL, but it really doesn't do what it says on the tin, because all the include paths still point to third_party/absl, rather than abseil proper. This issue has also been encountered by other people: #869.

Since I came up with my initial patches, sentencepiece is not forcing static linkage (/MT) of the MSVC runtime anymore (yay!). Other list of changes that were necessary for us:

[^2]: this at least should be getting easier as of protobuf 23, because autotools is gone, and we have better metadata for CMake / pkgconfig.

I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).

Would be happy if we could find a way that lets some of these improvements flow back upstream. 🙃 (I fully understand that some changes are specific to our needs, and it's fine if we keep carrying some patches; but any reduction helps, and some of the changes would also benefit other people I believe)

I'm happy to prepare dedicated PRs for individual changes, but wanted to discuss first which bits are interesting to the maintainers here, and how we could make them palatable for merging here (e.g. with switches based on symbols or CMake resp. environment variables).

h-vetinari commented 1 year ago

@taku910, did you have a moment to have a look at this issue already? Would there be a way to collaborate on this, e.g. starting with the abseil situation?

h-vetinari commented 1 year ago

Gentle ping @taku910 - is there a way I could contribute e.g. some of the abseil changes (to truly build against an external abseil) in a way that it would be acceptable to you?

h-vetinari commented 8 months ago

I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).

I've turned the patches against 0.1.99 into a tag, and rebased the branch of our patches against current master. With all the recent changes, the patch set has shrunken a lot, which is great! Thanks for your efforts to solve some of these tricky problems! 🙏

Of the remaining patches, the main thing that would be interesting to us to upstream in some form would be:

Is this something that you would be interested in @taku910?

Long-term it would be great to also support builds against shared abseil & protobuf (resp. shared sentencepiece builds on windows).