Open vigneshmj1997 opened 4 years ago
This is due to the input_sentence_size
, mining_sentence_size
, training_sentence_size
, and seed_sentencepiece_size
params set to Int32.MaxValue
in SentencePieceWrapper.cs (in the factored-segmenter/src folder).
I worked it around by changing the file with this sed
command:
sed -E -i 's,^[[:space:]]*\["((input|mining|training)_sentence|seed_sentencepiece)_size"] = spmParams\.,//&,' SentencePieceWrapper.cs
and then building the factored-segmenter.
As my data was not that huge, I had to lower the vocab size to just 8000.
time env LC_ALL=en_US.UTF-8 \ ~/factored-segmenter/src/bin/Release/netcoreapp3.1/linux-x64/publish/factored-segmenter train \ --model ~/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.fsm \ --distinguish-initial-and-internal-pieces --single-letter-case-factors --serialize-indices-and-unrepresentables --inline-fixes \ --min-piece-count 38 --min-char-count 2 --vocab-size 32000 \ /data1/SpeechTrans/ENU-DEU_Student.speech/train_segmenter.ENU.DEU.generalnn.joint/corpus.sampled
what file should be substituted in instead of "data1/SpeechTrans/ENU-DEU_Student.speech/train_segmenter.ENU.DEU.generalnn.joint/corpus.sampled" i kept a text file and i got following error
Unhandled exception. System.IO.IOException: Exit code 1 was returned by external process: /usr/local/bin/spm_train --input /home/vignesh/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.spmtmp.data --model_prefix /home/vignesh/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.spmtmp --vocab_size 32000 --add_dummy_prefix false --normalization_rule_name identity --split_by_whitespace false --remove_extra_whitespaces false --input_sentence_size 2147483647 --mining_sentence_size 2147483647 --training_sentence_size 2147483647 --seed_sentencepiece_size 2147483647 at Microsoft.MT.Common.Tokenization.ProcessTools.RunCommand(String exe, String args, String stdoutPath, String stderrPath, Boolean throwOnFailure, IEnumerable
1 envirVariables) at Microsoft.MT.Common.Tokenization.SentencePieceModel.SPMTrain(String inputPath, String modelPrefix, SentencePieceTrainConfig spmParams, String spmBinDir, Nullable
1 vocabSize) at Microsoft.MT.Common.Tokenization.SentencePieceModel.Train[Enumerable](Enumerable tokenStrings, String tempSPMModelPath, SentencePieceTrainConfig spmParams, Int32 minPieceCount, String spmBinDir) at Microsoft.MT.Common.Tokenization.FactoredSegmenterCoder.Train(FactoredSegmenterModelTrainConfig config, IEnumerable1 input, IEnumerable
1 sourceSentenceAnnotations, String fsmModelPath, String spmBinDir) at factored_segmenter.Program.Main(String[] args) Aborted (core dumped)can anyone help me with how to clear it