sillsdev / machine

Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages.
MIT License
26 stars 15 forks source link

Add UsfmVerseTextUpdater class #160

Closed ddaspit closed 9 months ago

ddaspit commented 9 months ago

This change is Reviewable

codecov-commenter commented 9 months ago

Codecov Report

Attention: 71 lines in your changes are missing coverage. Please review.

Comparison is base (79c0832) 66.04% compared to head (267de3c) 66.59%.

Files Patch % Lines
...achine/Corpora/ZipParatextProjectSettingsParser.cs 51.11% 17 Missing and 5 partials :warning:
...chine/Corpora/ParatextProjectSettingsParserBase.cs 80.76% 7 Missing and 8 partials :warning:
src/SIL.Machine/Corpora/UsfmParser.cs 31.81% 14 Missing and 1 partial :warning:
src/SIL.Machine/Corpora/UsfmVerseTextUpdater.cs 92.85% 8 Missing and 2 partials :warning:
...chine/Corpora/FileParatextProjectSettingsParser.cs 80.00% 3 Missing and 1 partial :warning:
...c/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs 83.33% 2 Missing and 1 partial :warning:
src/SIL.Machine/Corpora/ParatextProjectSettings.cs 94.59% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #160 +/- ## ========================================== + Coverage 66.04% 66.59% +0.55% ========================================== Files 428 433 +5 Lines 33529 33738 +209 Branches 4511 4525 +14 ========================================== + Hits 22144 22469 +325 + Misses 10301 10200 -101 + Partials 1084 1069 -15 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/FileParatextProjectSettingsParser.cs line 5 at r1 (raw file):

using System.IO;
using System.Linq;
using System.Text;

Wouldn't you put the usings into their own file?

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/FileParatextProjectSettingsParser.cs line 5 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…
Wouldn't you put the usings into their own file?

Ah, older version of .net.

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 67 at r1 (raw file):


            string prefix = "";
            string form = "41MAT";

So, we default to Matthew? Shouldn't we default to crashing? I don't think I fully understand this logic and intention.

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

            string form = "41MAT";
            string suffix = ".SFM";
            XElement namingElem = settingsDoc.Root.Element("Naming");

Part of the general question of if the data is not meeting expectations. I don't know the history of this code (I am assuming it is copied from somewhere else with minor edits). If that is the case, are there any substantive logic changes?

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…
Part of the general question of if the data is not meeting expectations. I don't know the history of this code (I am assuming it is copied from somewhere else with minor edits). If that is the case, are there any substantive logic changes?

How to handle missing data, exceptions, etc.

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 86 at r1 (raw file):


            string biblicalTerms = settingsDoc.Root.Element("BiblicalTermsListSetting").Value;
            string[] parts = biblicalTerms.Split(new[] { ':' }, 3);

There are so many ways to fail, including not getting 3 sections. Are we ok with just throwing an exception and failing the process?

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextTextCorpus.cs line 9 at r1 (raw file):

        public ParatextTextCorpus(string projectDir, bool includeMarkers = false)
        {
            var parser = new FileParatextProjectSettingsParser(projectDir);

For my information, what is the difference between ParatextTextCorpus and ParatextBackupTextCorpus?

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/UsfmParser.cs line 47 at r1 (raw file):

        }

        private static readonly Regex OptBreakSplitter = new Regex("(//)", RegexOptions.Compiled);

I'm assuming this is just a naming update. Descriptive names are better than comments?

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 31 at r1 (raw file):

                    return;

                var settingsParser = new ZipParatextProjectSettingsParser(archive);

Ah, the backup is using the zipped one...

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextTextCorpus.cs line 9 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…
For my information, what is the difference between ParatextTextCorpus and ParatextBackupTextCorpus?

Ah - one grabs from a zip file, the other from files on the computer.

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/UsfmVerseTextUpdater.cs line 117 at r1 (raw file):

            string pubNumber
        )
        {

A really simple explanation of what this is doing would be great. This class is not used in machine - likely only in Serval (which hasn't been pushed yet). It appears to be used to update an existing USFM file with pretranslations - but the logic appears split between a few locations.

johnml1135 commented 9 months ago

There is a bit of a lack of coverage for the TextUpdater, specifically the Sidebar, Milestone, Ref, OptBreak , Unmatched. Also, there is some missing coverage in the other files (as seen above in the report).

johnml1135 commented 9 months ago

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
Yes, this code was essentially extracted from another class and made to be reusable.

In aerospace, if code has been working for a long time, it's grandfathered in. That's fine - we can address issues if they come up.

johnml1135 commented 9 months ago

I would like more testing for these cases...

johnml1135 commented 9 months ago

I suspected that there could be an issue there :-)