Closed GeoDirk closed 2 years ago
Not sure what you mean by "correctly versification the text". I'm guessing you mean something like "parse the text by verse"? The plugin interfaces already have the ability to supply the plugin with tokens that should be able to be correctly interpreted by the plugin via the normal versification conversion process so that the plugin con work in whatever versification it needs even if the project uses a different/custom one.
@tombogle So if I'm reading in the USFM from one of the resources, it already is in it's own correct versification pattern (e.g., Russian Orthodox). Then I could use the VerseRef.ChangeVersification() to get a reference back to the ScrVers.Original. Do I have that right?
@GeoDirk, What information is needed from the usfm.sty file? You can get most of the styling information from IProject.ScriptureMarkerInformation. There is some documentation missing (that I'm planning on adding now) describing that the property can return IParagraphMarkerInfo, ICharacterMarkerInfo, and INoteMarkerInfo which will give you the styling information for each marker type.
Again, I'm not sure what data you need specifically from custom.vrs, but you can get versification data for a project from IProject.Versification. Going from one versification to another can be done via IVersification.ChangeVersification or IVerseRef.ChangeVersification.
EDIT: Standard versifications can be gotten from IPluginHost.GetStandardVersification.
@FoolRunning @tombogle I think I can get everything I need from the above then. As I mentioned above, we are using Machine's toolset for taking USFM and doing the tokenization step and Damian's function uses those two file inputs. But seems like we could regenerate them from the above information or override his function.
Thanks!
I'm not sure what tokenization you're doing, but there is IProject.GetUSFMTokens which returns the project USFM as tokens already.
In our case tokenization is where we breaking apart the verse text into all its subparts for SMT input.
In our case tokenization is where we breaking apart the verse text into all its subparts for SMT input.
While GetAllProjects() works as advertised, for NLP we need a couple of more things that are still hidden so we can correctly versification the text. Specifically, we need access to the "usfm.sty" and "custom.vrs" file data. Both of these are used by Damian's Machine in the versification/tokenization process.