ubsicap / paratext_demo_plugins

Sample code to demonstrate how to create a Paratext plugin
2 stars 4 forks source link

GetAllProjects() Additional Information #9

Closed GeoDirk closed 2 years ago

GeoDirk commented 2 years ago

While GetAllProjects() works as advertised, for NLP we need a couple of more things that are still hidden so we can correctly versification the text. Specifically, we need access to the "usfm.sty" and "custom.vrs" file data. Both of these are used by Damian's Machine in the versification/tokenization process.

tombogle commented 2 years ago

Not sure what you mean by "correctly versification the text". I'm guessing you mean something like "parse the text by verse"? The plugin interfaces already have the ability to supply the plugin with tokens that should be able to be correctly interpreted by the plugin via the normal versification conversion process so that the plugin con work in whatever versification it needs even if the project uses a different/custom one.

GeoDirk commented 2 years ago

@tombogle So if I'm reading in the USFM from one of the resources, it already is in it's own correct versification pattern (e.g., Russian Orthodox). Then I could use the VerseRef.ChangeVersification() to get a reference back to the ScrVers.Original. Do I have that right?

FoolRunning commented 2 years ago

@GeoDirk, What information is needed from the usfm.sty file? You can get most of the styling information from IProject.ScriptureMarkerInformation. There is some documentation missing (that I'm planning on adding now) describing that the property can return IParagraphMarkerInfo, ICharacterMarkerInfo, and INoteMarkerInfo which will give you the styling information for each marker type.

Again, I'm not sure what data you need specifically from custom.vrs, but you can get versification data for a project from IProject.Versification. Going from one versification to another can be done via IVersification.ChangeVersification or IVerseRef.ChangeVersification.

EDIT: Standard versifications can be gotten from IPluginHost.GetStandardVersification.

GeoDirk commented 2 years ago

@FoolRunning @tombogle I think I can get everything I need from the above then. As I mentioned above, we are using Machine's toolset for taking USFM and doing the tokenization step and Damian's function uses those two file inputs. But seems like we could regenerate them from the above information or override his function.

Thanks!

FoolRunning commented 2 years ago

I'm not sure what tokenization you're doing, but there is IProject.GetUSFMTokens which returns the project USFM as tokens already.

GeoDirk commented 2 years ago

In our case tokenization is where we breaking apart the verse text into all its subparts for SMT input.

GeoDirk commented 2 years ago

In our case tokenization is where we breaking apart the verse text into all its subparts for SMT input.