sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 4 forks source link

Add ISO codes to pair alignment filenames, add stats about trainable and draftable verses #368

Open isaac091 opened 3 months ago

isaac091 commented 3 months ago

It would be useful to add the iso code. I think that the alignments are calculated from an extracted file, and that file contains the iso code as part of the filename. That is the filename, less the .txt extension, which appears in the config.yml file. The code seems to remove that information from the corpus-stats.csv file.

Each file contains a total number of "verses-with-text", and it would be good to include that in a 'source: total verses' column.

It would be useful to know the maximum number of verses available for training: - i.e. how many verses-with-text in the source also have text in the target.

It would be useful to know how much data is available for drafting. i.e. how many verses-with-text in the source are empty in the target.

Originally posted by @davidbaines in https://github.com/sillsdev/silnlp/issues/317#issuecomment-2065809374

ddaspit commented 1 month ago

@isaac091 Has this been done yet?

isaac091 commented 1 month ago

The ISO codes in filenames bit got added in with the preprocessing efficiency changes I just made, but the additional stats have not been added. The information needed for those stats, the number of verses in the source that are empty in the target and vice versa, is not available with the current get_scripture_parallel_corpus function because it only returns verse pairs where both are non-empty. We will need to decide if we want to modify that function, or if doing so is hijacking the preprocessing process too much and we want to instead create a separate script for calculating the stats.