sillsdev / machine

Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages.
MIT License
26 stars 15 forks source link

Fix exception when parsing USFM with an empty verse paragraph #201

Closed ddaspit closed 5 months ago

ddaspit commented 5 months ago

This change is Reviewable

codecov-commenter commented 5 months ago

Codecov Report

Attention: Patch coverage is 81.81818% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 67.27%. Comparing base (61d792a) to head (7733533).

Files Patch % Lines
src/SIL.Machine/Corpora/ScriptureElement.cs 57.14% 2 Missing and 1 partial :warning:
...chine/Corpora/ScriptureRefUsfmParserHandlerBase.cs 88.46% 0 Missing and 3 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #201 +/- ## ========================================== + Coverage 67.24% 67.27% +0.03% ========================================== Files 441 441 Lines 34966 34993 +27 Branches 4689 4694 +5 ========================================== + Hits 23514 23543 +29 + Misses 10366 10361 -5 - Partials 1086 1089 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

johnml1135 commented 5 months ago

@ddaspit - added more errors found from the sample project. You may have a specific way you want to fix them.

johnml1135 commented 5 months ago

tests/SIL.Machine.Tests/Corpora/TestData/usfm/Tes/41MATTes.SFM line 5 at r2 (raw file):

\mt Matthew
\ip An introduction to Matthew\fe + \ft This is an endnote.\fe*
\p Here is another paragraph.

The current error is: System.InvalidOperationException : An error occurred while parsing the text 'MAT. Verse: MAT 1:0, offset: 179, error: 'Stack empty.'`. I still think that more context would be helpful - instead of people trying to count 179 tokens past the beginning of the verse (which starts exactly where?).

johnml1135 commented 5 months ago

tests/SIL.Machine.Tests/Corpora/TestData/usfm/Tes/41MATTes.SFM line 5 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
We could do line number and column instead of verse ref and verse offset. It would be a bit more work, since the UsfmTokenizer and UsfmParser aren't tracking line number and column, but it should be doable.

Line number and offset (along with verse ref) would be nice - let's do it.

johnml1135 commented 5 months ago

@ddaspit - Failed GetUsfm_NonVerse_Paragraph [84 ms]