sooftware / conformer

[Unofficial] PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)
Apache License 2.0
943 stars 174 forks source link

SentencePieceTrainer.Train error #23

Closed szalata closed 3 years ago

szalata commented 3 years ago
python prepare_libri.py --dataset_path ../../data/lasr/libri/LibriSpeech --vocab_size 5000
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=spm_input.txt --model_prefix=tokenizer --vocab_size=5000 --model_type=unigram --pad_id=0 --bos_id=1 --eos_id=2
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
  input: spm_input.txt
  input_format:
  model_prefix: tokenizer
  model_type: UNIGRAM
  vocab_size: 5000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars:
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}
denormalizer_spec {}
Traceback (most recent call last):
  File "prepare_libri.py", line 65, in <module>
    main()
  File "prepare_libri.py", line 58, in main
    prepare_tokenizer(transcripts_collection[0], opt.vocab_size)
  File "lasr/dataset/preprocess.py", line 71, in prepare_tokenizer
    spm.SentencePieceTrainer.Train(cmd)
  File "anaconda3/envs/lasr/lib/python3.7/site-packages/sentencepiece/__init__.py", line 407, in Train
    return SentencePieceTrainer._TrainFromString(arg)
  File "anaconda3/envs/lasr/lib/python3.7/site-packages/sentencepiece/__init__.py", line 385, in _TrainFromString
    return _sentencepiece.SentencePieceTrainer__TrainFromString(arg)
RuntimeError: Internal: /home/conda/feedstock_root/build_artifacts/sentencepiece_1612846348604/work/src/trainer_interface.cc(666) [insert_id(trainer_spec_.pad_id(), trainer_spec_.pad_piece())]
szalata commented 3 years ago

intended for another repo