sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
33 stars 3 forks source link

Extract not working ideally with versification 6. #198

Closed davidbaines closed 11 months ago

davidbaines commented 1 year ago

I tried to extract the resource GEO02 and although it contains a full Bible most of it was missing in the extract after DAN 3:23 I modified the sfm file for Daniel in several ways to test whether it would fix the problem. Removed extra line breaks in the middle of verses. Removed \p markers Removed \r markers to the end of the line. Removed alternative verse numbers of the form: \va 31\va* I made a simpler Settings.xml file I tested extracting after each change and none of these changes solved the problem.

I changed the versification setting in the Settings.xml file from 6 to 4 and that solved the problem.

mshannon-sil commented 11 months ago

It was determined that the project is using versification setting 4 for the Old Testament and versification setting 1 for the New Testament. If a project contains an Old Testament and New Testament with different versification settings, they should be split into separate paratext projects.

It would be helpful for the extract corpora script to provide a warning to the user if the verses in the project don't match with the versification setting in the Settings.xml file so that issues like this one can be more easily identified. I'll open up a separate issue for this.