mlc-ai / tokenizers-cpp

Universal cross-platform tokenizers binding to HF and sentencepiece
Apache License 2.0
211 stars 47 forks source link

Problem resolving some symbols when using the library in an Android C++ project (I am compiling using ndk) #31

Open cs-jlopezr opened 2 months ago

cs-jlopezr commented 2 months ago

I was able to successfully compile the library but when I use it as indicated in the example folder I am having the following errors:

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::SentencePieceProcessor()

referenced by sentencepiece_tokenizer.cc:18 (src/sentencepiece_tokenizer.cc:18) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::SentencePieceTokenizer(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::LoadFromSerializedProto(std::ndk1::basic_string_view<char, std::ndk1::char_traits >)

referenced by sentencepiece_tokenizer.cc:19 (src/sentencepiece_tokenizer.cc:19) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::SentencePieceTokenizer(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::util::Status::~Status()

referenced by sentencepiece_tokenizer.cc:19 (src/sentencepiece_tokenizer.cc:19) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::SentencePieceTokenizer(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a referenced by sentencepiece_tokenizer.cc:24 (src/sentencepiece_tokenizer.cc:24) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::Encode(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a referenced by sentencepiece_tokenizer.cc:24 (src/sentencepiece_tokenizer.cc:24) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::Encode(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a referenced 2 more times

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::~SentencePieceProcessor()

referenced by sentencepiece_tokenizer.cc:20 (src/sentencepiece_tokenizer.cc:20) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::SentencePieceTokenizer(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a referenced by sentencepiece_tokenizer.cc:16 (src/sentencepiece_tokenizer.cc:16) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::~SentencePieceTokenizer()) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::Encode(std::ndk1::basic_string_view<char, std::ndk1::char_traits >, std::ndk1::vector<int, std::ndk1::allocator >*) const

referenced by sentencepiece_tokenizer.cc:24 (src/sentencepiece_tokenizer.cc:24) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::Encode(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::util::Status::IgnoreError()

referenced by sentencepiece_tokenizer.cc:24 (src/sentencepiece_tokenizer.cc:24) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::Encode(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a referenced by sentencepiece_tokenizer.cc:30 (src/sentencepiece_tokenizer.cc:30) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::Decode(std::ndk1::vector<int, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::Decode(std::ndk1::vector<int, std::ndk1::allocator > const&, std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator >*) const

referenced by sentencepiece_tokenizer.cc:30 (src/sentencepiece_tokenizer.cc:30) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::Decode(std::ndk1::vector<int, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::GetPieceSize() const

referenced by sentencepiece_tokenizer.cc:35 (src/sentencepiece_tokenizer.cc:35) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::GetVocabSize()) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::IdToPiece(int) const

referenced by sentencepiece_tokenizer.cc:40 (src/sentencepiece_tokenizer.cc:40) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::IdToToken(int)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a

ld: error: undefined symbol: sentencepiece::SentencePieceProcessor::PieceToId(std::ndk1::basic_string_view<char, std::ndk1::char_traits >) const

referenced by sentencepiece_tokenizer.cc:42 (src/sentencepiece_tokenizer.cc:42) sentencepiece_tokenizer.cc.o:(tokenizers::SentencePieceTokenizer::TokenToId(std::ndk1::basic_string<char, std::__ndk1::char_traits, std::ndk1::allocator > const&)) in archive ./src/tokenizers-cpp/libtokenizers_cpp.a clang++: error: linker command failed with exit code 1 (use -v to see invocation)

When I check inside the library the symbols are properly defined.

In my code I am just doing the same as in the example folder, so I am not invoking directly the symbols that are not recognized. The ones that I am using (FromBlobSentencePiece, for example) are correctly identified. What could be the error?

One things which is curious for me is: why the compiler of my program is complaining about the src/sentencepiece_tokenizer.cc file if I am just using the static library (the .a file) through the tokenizers_cpp.h file provided by the library?

cs-jlopezr commented 2 months ago

I was able to solve the issue compiling the sentencepiece tokenizer library separately and adding the dependency explicitly. It is not clear in the usage instructions.

cs-jlopezr commented 2 months ago

And now, Not sure why I am getting a Segmentation fault when using the library. I am just doing the same as in the example. The initialization of the tokenizer is apparently ok but then when I want to encode: segmentation fault!