Open AdityaYadavalli1 opened 3 years ago
Would you be able to provide the .arpa file you've used?
Yes. I just renamed lm.arpa to lm.txt while attaching it because GitHub wouldn't allow me to attach .arpa files. lm.txt
I have also tried to train the LM using this code. However, I got the following error.
created Count File
one of required modified KneserNey count-of-counts is zero
error in discount estimator for order 1
created Language Model
Traceback (most recent call last):
File "spell-checker.py", line 1994, in <module>
main()
File "spell-checker.py", line 1978, in main
LM, EM, correction_input = process_arguments(args)
File "spell-checker.py", line 1752, in process_arguments
LM, unigrams, bigrams = buildLanguageModel(files=file_container)
File "spell-checker.py", line 582, in buildLanguageModel
LM = LanguageModel(os.path.join(Path(DATA_DIR, TARGET_LANGUAGE_MODEL)))
File "spell-checker.py", line 215, in __init__
arpa = open(arpa_file, "r", encoding="iso-8859-1")
FileNotFoundError: [Errno 2] No such file or directory: 'data/LM2.arpa'
This is the command I used for that:
python spell-checker.py --order 3 --train data/corpus.txt -lm LM2.arpa
Therefore I trained the LM externally.
Before this I had got another issue related to appended str and POSIXPath. So I edited line 566 to
subprocess.call(SRILM_PATH + "/ngram-count -vocab "+str(DATA_DIR)+"/vocabulary.count -order " + str(
N_GRAM) + " -no-eos -no-sos -text "+ str(DATA_DIR)+"/corpus.txt -unk -write "+str(DATA_DIR)+"/count" + str(N_GRAM) + ".count", shell=True)
and 577 to
subprocess.call(SRILM_PATH + "/ngram-count -vocab "+str(DATA_DIR)+"/vocabulary.count -order " + str(
N_GRAM) + " -unk -no-eos -no-sos -read "+str(DATA_DIR)+"/count" + str(N_GRAM) + ".count -lm "+str(DATA_DIR)+"/" + TARGET_LANGUAGE_MODEL +" " + smooth, shell=True)
These are minor modifications and I expect it to not have an effect on anything else.
Hi,
I have a Hindi trigram LMs built using SRILM (.arpa files) already. I would like to use them correct spellings in another Hindi text file. However, I get the following error when trying to do that.
Following is the command I used:
python spell-checker.py --order 3 --correct data/corpus.txt -lm data/lm.arpa
corpus.txt is the file which has incorrect spellings and I want lm.arpa LM to fix them.
This seems like an UTF-8 encoding issue. Is there any way to circumvent this issue?
Thanks in advance