Open h-vetinari opened 1 year ago
@taku910, did you have a moment to have a look at this issue already? Would there be a way to collaborate on this, e.g. starting with the abseil situation?
Gentle ping @taku910 - is there a way I could contribute e.g. some of the abseil changes (to truly build against an external abseil) in a way that it would be acceptable to you?
I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).
I've turned the patches against 0.1.99 into a tag, and rebased the branch of our patches against current master. With all the recent changes, the patch set has shrunken a lot, which is great! Thanks for your efforts to solve some of these tricky problems! 🙏
Of the remaining patches, the main thing that would be interesting to us to upstream in some form would be:
SPM_ENABLE_SHARED
should only build static libs, or there should be a similar SPM_DISABLE_STATIC
optionlibsentencepiece.so
Is this something that you would be interested in @taku910?
Long-term it would be great to also support builds against shared abseil & protobuf (resp. shared sentencepiece builds on windows).
Hi there - I help package many a package for conda-forge, which is a pretty big ecosystem of packages especially in data science and ML/AI. It started out mainly focussed on Python, but has come to include many other languages as well.
One strong point (especially in comparison to pip) is that we are able to sanely redistribute non-Python artefacts. We can track the relevant ABI of a package and rebuild its descendant packages where necessary. This allows us to (almost) exclusively distribute shared libraries, which lowers package footprints, and has several other benefits.
We've been shipping sentencepiece for about 3 years through the so-called feedstock, and packages can be installed like:
We have builds for linux-{64, aarch64, ppc64le}, osx-{64, arm64}, win-64, multiplied by CPython 3.8-3.11 as well as PyPy 3.8 & 3.9.
In the context of wanting to package torchtext (which depends on sentencepiece -- but really only the C++ bits) I wanted to split off a separate "output" / "package" called
libsentencepiece
that torchtext can build against, without having to spuriously rebuild it per python version (it would literally be the same in each case).Also, crucially, we absolutely need to rely on our own abseil & protobuf builds, which became a pretty big sticking point. I finally managed to pull off this split last year (there's references there to my very long-winded previous tries; eventually I gave up on trying to build a shared library on windows because it was getting too insane[^1]).
[^1]: by that point I was already patching the output generated by protobuf for some silly constexpr errors
However, at the time, I ran out of steam to raise an issue here, and discuss how - if at all - it would be possible to upstream some of these changes. That's all the more the case because some of the patches I came up with are definitely not ready for primetime here. I got quite exasperated with how the code in
thirdparty/absl
was neither third-party nor abseil proper[^3], so I ended up replacing that wholesale. Of course that doesn't mean I'm suggesting to do this here verbatim.[^3]: as force-replacing it with actual abseil lead to build errors because -- to the best of my understanding -- it contains sentencepiece-specific glue-code, but in the
absl
namespace.However, the abseil situation is perhaps one of the biggest sticking points. There's a variable called
SPM_USE_EXTERNAL_ABSL
, but it really doesn't do what it says on the tin, because all the include paths still point tothird_party/absl
, rather than abseil proper. This issue has also been encountered by other people: #869.Since I came up with my initial patches, sentencepiece is not forcing static linkage (
/MT
) of the MSVC runtime anymore (yay!). Other list of changes that were necessary for us:PROTOBUF_USE_DLLS
[^2])CMAKE_INSTALL_PREFIX
libsentencepiece
(in our case, this will always match the underlying version 1:1, though the ABI-guarantees - or lack thereof - of sentencepiece would be good to know)[^2]: this at least should be getting easier as of protobuf 23, because autotools is gone, and we have better metadata for CMake / pkgconfig.
I've rebased my patches (branch) against 0.1.99 and am currently re-building our redistribution recipe against both protobuf 3.21 & 4.23 (this also runs the test suite afterwards to catch regressions).
Would be happy if we could find a way that lets some of these improvements flow back upstream. 🙃 (I fully understand that some changes are specific to our needs, and it's fine if we keep carrying some patches; but any reduction helps, and some of the changes would also benefit other people I believe)
I'm happy to prepare dedicated PRs for individual changes, but wanted to discuss first which bits are interesting to the maintainers here, and how we could make them palatable for merging here (e.g. with switches based on symbols or CMake resp. environment variables).