Closed szalata closed 3 years ago
python prepare_libri.py --dataset_path ../../data/lasr/libri/LibriSpeech --vocab_size 5000 sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=spm_input.txt --model_prefix=tokenizer --vocab_size=5000 --model_type=unigram --pad_id=0 --bos_id=1 --eos_id=2 sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : trainer_spec { input: spm_input.txt input_format: model_prefix: tokenizer model_type: UNIGRAM vocab_size: 5000 self_test_sample_size: 0 character_coverage: 0.9995 input_sentence_size: 0 shuffle_input_sentence: 1 seed_sentencepiece_size: 1000000 shrinking_factor: 0.75 max_sentence_length: 4192 num_threads: 16 num_sub_iterations: 2 max_sentencepiece_length: 16 split_by_unicode_script: 1 split_by_number: 1 split_by_whitespace: 1 split_digits: 0 treat_whitespace_as_suffix: 0 required_chars: byte_fallback: 0 vocabulary_output_piece_score: 1 train_extremely_large_corpus: 0 hard_vocab_limit: 1 use_all_vocab: 0 unk_id: 0 bos_id: 1 eos_id: 2 pad_id: 0 unk_piece: <unk> bos_piece: <s> eos_piece: </s> pad_piece: <pad> unk_surface: ⁇ } normalizer_spec { name: nmt_nfkc add_dummy_prefix: 1 remove_extra_whitespaces: 1 escape_whitespaces: 1 normalization_rule_tsv: } denormalizer_spec {} Traceback (most recent call last): File "prepare_libri.py", line 65, in <module> main() File "prepare_libri.py", line 58, in main prepare_tokenizer(transcripts_collection[0], opt.vocab_size) File "lasr/dataset/preprocess.py", line 71, in prepare_tokenizer spm.SentencePieceTrainer.Train(cmd) File "anaconda3/envs/lasr/lib/python3.7/site-packages/sentencepiece/__init__.py", line 407, in Train return SentencePieceTrainer._TrainFromString(arg) File "anaconda3/envs/lasr/lib/python3.7/site-packages/sentencepiece/__init__.py", line 385, in _TrainFromString return _sentencepiece.SentencePieceTrainer__TrainFromString(arg) RuntimeError: Internal: /home/conda/feedstock_root/build_artifacts/sentencepiece_1612846348604/work/src/trainer_interface.cc(666) [insert_id(trainer_spec_.pad_id(), trainer_spec_.pad_piece())]
intended for another repo