Triggering Run Basic Checks

GeoDirk commented 2 years ago

For our purposes, when parsing through the USFM tokens, we are coming across projects that have a bunch of weird things with their verse tags:

\v : empty verse tags
\v 4Tonga : places where the verse tag and the verse text run together
\v 4- : places where they accidentally forgot to finish off a verse range

I'm sure that we will be finding more and more of these types of USFM errors as we go along. We usually can detect these in our plugin and report them back to the user. However, ideally it would be fantastic if we could trigger the "Run Basic Checks" function and make the user clean up the mess. For what we've encountered thus far, the basic checks would have caught all of these issues and then get back a report on what is bad. This feature would be a new enhancement to the API.

Alternatively, you all probably have a standard library out there that could look at the USFM and do the checks. Anything like that out there in your public libraries?

FoolRunning commented 2 years ago

Unfortunately, all the checking code is currently in the Paratext executable (which is not public).

tombogle commented 2 years ago

I wonder if maybe something could be added to the API to indicate that you want the USFM tokens, but only if/when the checks have been run and passed cleanly.

FoolRunning commented 2 years ago

@GeoDirk, I'm not sure how clean you need your data or if it would work for what you need, but you could try get the USX first using strict=true to make sure that the data is clean before reading in the tokens.

GeoDirk commented 2 years ago

Would using USX and strict = true skip verse data? If so, then I would rather not go that route and stick with the alerts that I have now parsing the USFM.

Basically I need to produce the equivalent of:

01001001 In the beginning, God created the heavens and the earth.
01001002 ...

Which is why bad verse tags are problematic. Obtaining the verse text without the extra attributes has been surprisingly easy with parsing through your USFM tokens.

We are using this data to send it off to NLP for processing and looking for alignments hence why we need the precision.

FoolRunning commented 2 years ago

Yeah, it probably won't work to use strict=true since that will validate a bunch of other stuff you probably don't want validated (i.e. it's designed to get the data to a pristine state to be uploaded to DBL, for example).

ubsicap / paratext_demo_plugins

Triggering Run Basic Checks #14