Closed rminsil closed 1 month ago
Michael reported:
when I run the extract_xri script on the latest datasets that we got from XRI, the extract files have extra newlines between each sentence. For instance, this is what the Swahili extract file looks like for one of the datasets:
.
Aah! Uweze kufufuka, kila mtu aje, achague njia, na afufuke kwa mwanzo mpya.
Aalimuita "njoo" iliamthibitishie kiasi kwamba mbingu itakuja kwa wale wanaoitafuta kweli.
Acha kila mmoja wa waumini achukue muda huu si tu kuamini, bali pia awagawie wengine.
... and so on ...
The target language extract file is similarly formatted.
I suspect this has to do with the use of
os.linesep()
in thewrite_output_file
method:f.write(f"{sentence}{os.linesep}")
Changing this to just a newline resolves the issue:f.write(sentence + "\n")
Perhaps a difference between running on Windows and Linux?
I'll update this comment when I get more clarification on Michael's setup and what the downstream tools expect the line endings to be.
UPDATE: Michael is using Windows 11. He's using a pycharm terminal to generate the extract files. He's using notepad++ to read the files. I guess he has notepad++ set to expect unix line endings.
Michael's recommendation:
I'd suggest applying the already suggested newline fix, as long as that also works correctly on a Linux system.
Addressed here: https://github.com/sillsdev/silnlp/pull/548
Michael reported:
looking at the first few entries in that Swahili file (one is blank, the other is just a period), it looks like we could use some additional filtering logic. We had identified one filtering rule that looked for "!" as the target translation. Perhaps we could expand that to filter any single character source or target sentence? And to filter out any empty source or target sentence?
This is clear.
The current requirements are to:
The new requirements would be:
This would filter out cases where a sentence is:
I'll use a different log message for cases when they use '!' to make it easier to grep the number of aborted translations.
Addressed here: https://github.com/sillsdev/silnlp/pull/547
Background
See parent issue for general context: https://github.com/sillsdev/silnlp/issues/472
Overview
https://github.com/sillsdev/silnlp/issues/473 created a script
extract_xri.py
in the silnlp repo.Michael has begun testing the script on Tanzanian files and has a Tanzanian workshop coming up.
The scope of this ticket is to capture his feedback and execute on it.