sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

Open mmartin9684-sil opened 3 weeks ago

mmartin9684-sil commented 3 weeks ago

A project (Galego_2024_10_24.save) has an invalid book ID in the "\id" tag ("\id JUD") for the book of Judges; it should be "JDG" instead of "JUD".

No error was reported for this mismatch, but the book content was not extracted into the extract file. Also, most other books in the project were not extracted into the extract file either. The extract file contained verse extracts only for GEN, JUD, and REV, even though the project has a complete NT and multiple OT books.

A warning about the book ID mistake would be helpful, and the error should not affect the extraction of the other books.

mshannon-sil commented 6 days ago

I took a deep dive into what's going on, and this is what I found. In the paratext project, as you mentioned, the file for Judges is using "JUD" for the "\id", which is the id for the book of Jude, not Judges. So when the project_corpus is initialized inside the extract_corpora function, the versification for the project_corpus rows incorrectly uses JUD when it gets to Judges, although at this point there aren't any verses missing. The verses go missing when the project_corpus is then aligned with the reference_corpus (the corpus that contains every possible verse ref in the chosen versification) to create the parallel_corpus. The get_rows method for the parallel_corpus first goes through Genesis like normal, and then when it reaches the second book Judges, it sees the JUD id in the project_corpus rows, and then iterates through the reference corpus rows until it gets to JUD. Then since the reference corpus is now at JUD, once all the JUD rows are finished, REV is the next verse ref in the reference_corpus, and get_rows will iterate through the project_corpus until it gets to REV as well, skipping any verses that come before that.

Saying all this, I think the best solution would be to prevent this scenario from happening by throwing an error if the "\id" tag does not match the book id in the filename. I don't think allowing the error to exist without affecting the extraction process and just throwing a warning is practical, since it would mean needing to completely redesign the algorithm for aligning rows, and that's probably not necessary just to handle malformed data.

ddaspit commented 6 days ago

@mshannon-sil I agree with your analysis. When there is a mismatch between the file name and the "\id" tag, we don't know which one is correct. We could decide to always use the book code from the file name and ignore the "\id" tag and that would fix the issue in this case, but in other cases, it might be the opposite. The safest thing to do is to throw an exception.