uds-lsv / Noisy-Channel-Spell-Checker

A tool for correcting misspellings in textual input using the Noisy Channel Model.
Apache License 2.0
11 stars 1 forks source link

Spell Check using LM built externally #1

Open AdityaYadavalli1 opened 3 years ago

AdityaYadavalli1 commented 3 years ago

Hi,

I have a Hindi trigram LMs built using SRILM (.arpa files) already. I would like to use them correct spellings in another Hindi text file. However, I get the following error when trying to do that.

`Traceback (most recent call last):
  File "spell-checker.py", line 1994, in <module>
    main()
  File "spell-checker.py", line 1978, in main
    LM, EM, correction_input = process_arguments(args)
  File "spell-checker.py", line 1730, in process_arguments
    LM, unigrams, bigrams = buildLanguageModel(arpa_file=os.path.join(Path(args.languagemodel)))
  File "spell-checker.py", line 422, in buildLanguageModel
    LM = LanguageModel(arpa_file)
  File "spell-checker.py", line 245, in __init__
    self[ " ".join(line[1:order_counter+1])] = (float(line[0]), float(line[order_counter+1]))
ValueError: could not convert string to float: 'à¤\x81धà¥\x87रा' `

Following is the command I used:

python spell-checker.py --order 3 --correct data/corpus.txt -lm data/lm.arpa

corpus.txt is the file which has incorrect spellings and I want lm.arpa LM to fix them.

This seems like an UTF-8 encoding issue. Is there any way to circumvent this issue?

Thanks in advance

kleiba commented 3 years ago

Would you be able to provide the .arpa file you've used?

AdityaYadavalli1 commented 3 years ago

Yes. I just renamed lm.arpa to lm.txt while attaching it because GitHub wouldn't allow me to attach .arpa files. lm.txt

AdityaYadavalli1 commented 3 years ago

I have also tried to train the LM using this code. However, I got the following error.

created Count File
one of required modified KneserNey count-of-counts is zero
error in discount estimator for order 1
created Language Model
Traceback (most recent call last):
  File "spell-checker.py", line 1994, in <module>
    main()
  File "spell-checker.py", line 1978, in main
    LM, EM, correction_input = process_arguments(args)
  File "spell-checker.py", line 1752, in process_arguments
    LM, unigrams, bigrams = buildLanguageModel(files=file_container)
  File "spell-checker.py", line 582, in buildLanguageModel
    LM = LanguageModel(os.path.join(Path(DATA_DIR, TARGET_LANGUAGE_MODEL)))
  File "spell-checker.py", line 215, in __init__
    arpa = open(arpa_file, "r", encoding="iso-8859-1")
FileNotFoundError: [Errno 2] No such file or directory: 'data/LM2.arpa' 

This is the command I used for that: python spell-checker.py --order 3 --train data/corpus.txt -lm LM2.arpa

Therefore I trained the LM externally.

Before this I had got another issue related to appended str and POSIXPath. So I edited line 566 to

subprocess.call(SRILM_PATH + "/ngram-count -vocab "+str(DATA_DIR)+"/vocabulary.count   -order " + str(
   N_GRAM) + "   -no-eos -no-sos    -text "+ str(DATA_DIR)+"/corpus.txt  -unk  -write "+str(DATA_DIR)+"/count" + str(N_GRAM) + ".count", shell=True)

and 577 to

subprocess.call(SRILM_PATH + "/ngram-count   -vocab "+str(DATA_DIR)+"/vocabulary.count   -order " + str(
  N_GRAM) + "  -unk -no-eos  -no-sos -read "+str(DATA_DIR)+"/count" + str(N_GRAM) + ".count  -lm "+str(DATA_DIR)+"/" +  TARGET_LANGUAGE_MODEL +" " + smooth, shell=True)

These are minor modifications and I expect it to not have an effect on anything else.