sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
31 stars 3 forks source link

Error with preprocessing with the --stats option for certain files. #362

Closed davidbaines closed 6 months ago

davidbaines commented 6 months ago

When calculating an alignment for ru-CASS.txt there was the following error message.

Traceback (most recent call last): File "C:\Program Files\Python38\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python38\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\preprocess.py", line 30, in main() File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\preprocess.py", line 25, in main config.preprocess(args.stats, args.force_align) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\config.py", line 485, in preprocess self._build_corpora(tokenizer, stats, force_align) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\open_nmt_config.py", line 650, in _build_corpora train_count = super()._build_corpora(tokenizer, stats, force_align) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\config.py", line 525, in _build_corpora train_count += self._write_scripture_data_sets(tokenizer, pair, stats, force_align) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\config.py", line 794, in _write_scripture_data_sets src_script = get_script("".join(cur_train["source"])) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\common\script_utils.py", line 1864, in get_script return counts.most_common()[0][0] IndexError: list index out of range

The file contains Cyrillic characters, so it's not just an empty file.

Enkidu93 commented 6 months ago

I had the exact same issue earlier today. Isaac already did add some nicer error catching. Were you aligning on only certain books, @davidbaines ?

davidbaines commented 6 months ago

This was an alignment on the whole text.

isaac091 commented 6 months ago

Hi David, I was able to run the alignment locally, aligning to another random Bible. Is it possible that any of the Bibles you were aligning with are empty? Can you try pulling the most recent code to see if my fix for Eli's issue works for you, if you haven't already? If that doesn't work, can you point me to your config file?

Thanks!

davidbaines commented 6 months ago

Both files contain data, I think the problem was that there were no lines of text in common. The error is fixed now.