google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.25k stars 1.17k forks source link

coredump when build with CXXFLAGS `-Wp,-D_GLIBCXX_ASSERTIONS` #987

Open Henry-ZHR opened 7 months ago

Henry-ZHR commented 7 months ago

Current commit: 4d6a1f41069c4636c51a5590f7578a0dbed83450

Running the following in a clean ubuntu:latest docker container

apt update
apt install -y cmake build-essential pkg-config libgoogle-perftools-dev git gdb
cd /tmp
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-Wp,-D_GLIBCXX_ASSERTIONS"
cmake --build build --parallel $(nproc)
cmake --install build
ldconfig -v
gdb --batch -ex run -ex bt --args spm_train --input=data/botchan.txt --model_prefix=test_tmp/m --vocab_size=1000

Will get the following output:

/usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = unsigned int; _Alloc = std::allocator<unsigned int>; std::vector<_Tp, _Alloc>::reference = unsigned int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.
Thread 1 "spm_train" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=134030593879168) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=134030593879168) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=134030593879168) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=134030593879168, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000079e66e029476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000079e66e00f7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00005e95735eae78 in std::__replacement_assert (__file=<optimized out>, __line=<optimized out>, __function=<optimized out>, __condition=<optimized out>) at /usr/include/x86_64-linux-gnu/c++/11/bits/c++config.h:514
#6  0x000079e66e7b46e9 in std::vector<unsigned int, std::allocator<unsigned int> >::operator[] (this=<optimized out>, __n=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:1045
#7  std::vector<unsigned int, std::allocator<unsigned int> >::operator[] (this=0x7fff6827e780, this=0x7fff6827e780, __n=<optimized out>) at /usr/include/c++/11/bits/stl_vector.h:1043
#8  sentencepiece::unigram::Trainer::MakeSeedSentencePiecesInternal<int> (this=0x5e957589c000) at /tmp/sentencepiece/src/unigram_model_trainer.cc:280
#9  0x000079e66e79f17d in sentencepiece::unigram::Trainer::MakeSeedSentencePieces[abi:cxx11]() (this=this@entry=0x5e957589c000) at /tmp/sentencepiece/src/unigram_model_trainer.cc:142
#10 0x000079e66e7a2486 in sentencepiece::unigram::Trainer::Train (this=0x5e957589c000) at /tmp/sentencepiece/src/unigram_model_trainer.cc:595
#11 0x000079e66e7ca4d2 in sentencepiece::TrainerInterface::Train (output_model_proto=0x0, sentence_iterator=0x0, this=<optimized out>) at /tmp/sentencepiece/src/trainer_interface.h:97
#12 sentencepiece::SentencePieceTrainer::Train (trainer_spec=..., normalizer_spec=..., denormalizer_spec=..., sentence_iterator=sentence_iterator@entry=0x0, serialized_model_proto=serialized_model_proto@entry=0x0) at /tmp/sentencepiece/src/sentencepiece_trainer.cc:85
#13 0x00005e95735ea567 in main (argc=<optimized out>, argv=<optimized out>) at /tmp/sentencepiece/src/spm_train_main.cc:282

https://github.com/google/sentencepiece/blob/4d6a1f41069c4636c51a5590f7578a0dbed83450/src/unigram_model_trainer.cc#L280

Seems the same place of #966