ugermann / ssplit-cpp

Approximate reimplementation of the sentence splitter from the Moses toolkit.
Other
4 stars 9 forks source link

Test cases #7

Closed jerinphilip closed 3 years ago

jerinphilip commented 3 years ago

WIP: Adding a simple-test app for testing with GitHub CI.

jerinphilip commented 3 years ago

@ugermann I have added a few tests (files are small) and included in the source under tests directory for now. When I try to stream with one_paragraph_per_line, I am getting newlines which I should not be. You can find the failure in GitHub actions failure logs below this comment. Is this expected?

Could you document the behaviour of one_sentence_per_line, one_paragraph_per_line and wrapped_text somewhere? My understanding is that:

  1. one_sentence_per_line: Should be simple cin.getline (equivalent in speed, but string_view returned)
  2. one_paragraph_per_line: Regex based split of sentences within each line which is assumed to be a paragraph. Stream will provide me only valid sentences. (This is giving me an empty line and failing is what I think is happening in the test).
  3. wrapped_text: Text wrapped to n characters. Paragraphs are broken by an empty line.
ugermann commented 3 years ago

Behaviour of the ssplit executable is now documented in the README. The empty line between paragraphs is on purpose, even for one paragraph per line. How else would you preserve paragraph boundaries in text mode? The same by the way pertains to the SentenceStream streaming operator. An empty string_view indicates paragraph boundaries.

jerinphilip commented 3 years ago

@ugermann: The changes are adjusted to not mess with any existing CPP API. A few additional tests are added. Can you merge this in so it's possible to know if the behaviour breaks in the future through GitHub actions. bergamot-translator now depends on ssplit-cpp and it would be convenient to know if something changes.

As feedback, I don't particularly like the empty string_view marker being present. Knowing the original byterange in the reference text is enough for me. I had unnecessary trouble with empty string_view and had to add guards to adjust the annotation code for tests. But I can live with writing the if(sentence.size()) guards for now, if it finds a use-case elsewhere.

Edit: I also am not able to trigger GitHub actions on the repo at the moment, please look into this, if possible.