rminsil commented 1 month ago

Background

See parent issue for general context: https://github.com/sillsdev/silnlp/issues/472

Overview

https://github.com/sillsdev/silnlp/issues/473 created a script extract_xri.py in the silnlp repo.

Michael has begun testing the script on Tanzanian files and has a Tanzanian workshop coming up.

The scope of this ticket is to capture his feedback and execute on it.

rminsil commented 1 month ago

Line separator issue

Michael reported:

when I run the extract_xri script on the latest datasets that we got from XRI, the extract files have extra newlines between each sentence. For instance, this is what the Swahili extract file looks like for one of the datasets:

.

Aah! Uweze kufufuka, kila mtu aje, achague njia, na afufuke kwa mwanzo mpya.

Aalimuita "njoo" iliamthibitishie kiasi kwamba mbingu itakuja kwa wale wanaoitafuta kweli.

Acha kila mmoja wa waumini achukue muda huu si tu kuamini, bali pia awagawie wengine.

... and so on ...

The target language extract file is similarly formatted.

I suspect this has to do with the use of os.linesep() in the write_output_file method: f.write(f"{sentence}{os.linesep}") Changing this to just a newline resolves the issue: f.write(sentence + "\n") Perhaps a difference between running on Windows and Linux?

I'll update this comment when I get more clarification on Michael's setup and what the downstream tools expect the line endings to be.

UPDATE: Michael is using Windows 11. He's using a pycharm terminal to generate the extract files. He's using notepad++ to read the files. I guess he has notepad++ set to expect unix line endings.

Michael's recommendation:

I'd suggest applying the already suggested newline fix, as long as that also works correctly on a Linux system.

Addressed here: https://github.com/sillsdev/silnlp/pull/548

rminsil commented 1 month ago

Filtering of single character sentences

Michael reported:

looking at the first few entries in that Swahili file (one is blank, the other is just a period), it looks like we could use some additional filtering logic. We had identified one filtering rule that looked for "!" as the target translation. Perhaps we could expand that to filter any single character source or target sentence? And to filter out any empty source or target sentence?

This is clear.

The current requirements are to:

trim source and target
filter if the target sentence is '!'

The new requirements would be:

trim source and target
filter if either source or target have 0 or 1 characters

This would filter out cases where a sentence is:

empty
just some whitespace
a single non-whitespace character
a single non-whitespace character with some boundary whitespace around it

I'll use a different log message for cases when they use '!' to make it easier to grep the number of aborted translations.

Addressed here: https://github.com/sillsdev/silnlp/pull/547

sillsdev / silnlp

Iterate on XRI extract script after initial usage #546

Background

Overview

Line separator issue

Filtering of single character sentences