microsoft / factored-segmenter

Unsupervised factor-based text tokenizer for natural-language processing applications
17 stars 3 forks source link

not able to train the moel #3

Open vigneshmj1997 opened 4 years ago

vigneshmj1997 commented 4 years ago

time env LC_ALL=en_US.UTF-8 \ ~/factored-segmenter/src/bin/Release/netcoreapp3.1/linux-x64/publish/factored-segmenter train \ --model ~/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.fsm \ --distinguish-initial-and-internal-pieces --single-letter-case-factors --serialize-indices-and-unrepresentables --inline-fixes \ --min-piece-count 38 --min-char-count 2 --vocab-size 32000 \ /data1/SpeechTrans/ENU-DEU_Student.speech/train_segmenter.ENU.DEU.generalnn.joint/corpus.sampled

what file should be substituted in instead of "data1/SpeechTrans/ENU-DEU_Student.speech/train_segmenter.ENU.DEU.generalnn.joint/corpus.sampled" i kept a text file and i got following error

Unhandled exception. System.IO.IOException: Exit code 1 was returned by external process: /usr/local/bin/spm_train --input /home/vignesh/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.spmtmp.data --model_prefix /home/vignesh/factored-segmenter/out/enu.deu.generalnn.joint.segmenter.spmtmp --vocab_size 32000 --add_dummy_prefix false --normalization_rule_name identity --split_by_whitespace false --remove_extra_whitespaces false --input_sentence_size 2147483647 --mining_sentence_size 2147483647 --training_sentence_size 2147483647 --seed_sentencepiece_size 2147483647 at Microsoft.MT.Common.Tokenization.ProcessTools.RunCommand(String exe, String args, String stdoutPath, String stderrPath, Boolean throwOnFailure, IEnumerable1 envirVariables) at Microsoft.MT.Common.Tokenization.SentencePieceModel.SPMTrain(String inputPath, String modelPrefix, SentencePieceTrainConfig spmParams, String spmBinDir, Nullable1 vocabSize) at Microsoft.MT.Common.Tokenization.SentencePieceModel.Train[Enumerable](Enumerable tokenStrings, String tempSPMModelPath, SentencePieceTrainConfig spmParams, Int32 minPieceCount, String spmBinDir) at Microsoft.MT.Common.Tokenization.FactoredSegmenterCoder.Train(FactoredSegmenterModelTrainConfig config, IEnumerable1 input, IEnumerable1 sourceSentenceAnnotations, String fsmModelPath, String spmBinDir) at factored_segmenter.Program.Main(String[] args) Aborted (core dumped)

can anyone help me with how to clear it

stribizhev commented 3 years ago

This is due to the input_sentence_size, mining_sentence_size, training_sentence_size, and seed_sentencepiece_size params set to Int32.MaxValue in SentencePieceWrapper.cs (in the factored-segmenter/src folder).

I worked it around by changing the file with this sed command:

sed -E -i 's,^[[:space:]]*\["((input|mining|training)_sentence|seed_sentencepiece)_size"] = spmParams\.,//&,' SentencePieceWrapper.cs

and then building the factored-segmenter.

As my data was not that huge, I had to lower the vocab size to just 8000.