sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
31 stars 3 forks source link

Reuse the same code for extracting verses where possible. #158

Open davidbaines opened 1 year ago

davidbaines commented 1 year ago

At least three scripts have to extract verse data from Paratext projects or USFM files. translate.py, bulk_extract_corpora.py and extract_corpora.py

translate.py does not properly remove Strong's numbers from projects that include them, where as the bulk_extract_corpora script does. It would be good to check that all the parts of the pipeline that extract verse text from USFM use the same code.

davidbaines commented 1 year ago

This fix might also resolve Issue #157

isaac091 commented 2 weeks ago

Hi @davidbaines, can you verify whether or not Strong's numbers get removed now with with the updates to translate.py? Since bulk_extract_corpora.py was already using the machine.py parser like you said, I would assume that there is no longer an issue.