Closed davidbaines closed 6 months ago
I had the exact same issue earlier today. Isaac already did add some nicer error catching. Were you aligning on only certain books, @davidbaines ?
This was an alignment on the whole text.
Hi David, I was able to run the alignment locally, aligning to another random Bible. Is it possible that any of the Bibles you were aligning with are empty? Can you try pulling the most recent code to see if my fix for Eli's issue works for you, if you haven't already? If that doesn't work, can you point me to your config file?
Thanks!
Both files contain data, I think the problem was that there were no lines of text in common. The error is fixed now.
When calculating an alignment for ru-CASS.txt there was the following error message.
Traceback (most recent call last): File "C:\Program Files\Python38\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Program Files\Python38\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\preprocess.py", line 30, in
main()
File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\preprocess.py", line 25, in main
config.preprocess(args.stats, args.force_align)
File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\config.py", line 485, in preprocess
self._build_corpora(tokenizer, stats, force_align)
File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\open_nmt_config.py", line 650, in _build_corpora
train_count = super()._build_corpora(tokenizer, stats, force_align)
File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\config.py", line 525, in _build_corpora
train_count += self._write_scripture_data_sets(tokenizer, pair, stats, force_align)
File "C:\Users\David\Documents\GitHub\silnlp\silnlp\nmt\config.py", line 794, in _write_scripture_data_sets
src_script = get_script("".join(cur_train["source"]))
File "C:\Users\David\Documents\GitHub\silnlp\silnlp\common\script_utils.py", line 1864, in get_script
return counts.most_common()[0][0]
IndexError: list index out of range
The file contains Cyrillic characters, so it's not just an empty file.