Open isaac091 opened 3 months ago
@isaac091 Has this been done yet?
The ISO codes in filenames bit got added in with the preprocessing efficiency changes I just made, but the additional stats have not been added. The information needed for those stats, the number of verses in the source that are empty in the target and vice versa, is not available with the current get_scripture_parallel_corpus
function because it only returns verse pairs where both are non-empty. We will need to decide if we want to modify that function, or if doing so is hijacking the preprocessing process too much and we want to instead create a separate script for calculating the stats.
It would be useful to add the iso code. I think that the alignments are calculated from an extracted file, and that file contains the iso code as part of the filename. That is the filename, less the .txt extension, which appears in the config.yml file. The code seems to remove that information from the corpus-stats.csv file.
Each file contains a total number of "verses-with-text", and it would be good to include that in a 'source: total verses' column.
It would be useful to know the maximum number of verses available for training: - i.e. how many verses-with-text in the source also have text in the target.
It would be useful to know how much data is available for drafting. i.e. how many verses-with-text in the source are empty in the target.
Originally posted by @davidbaines in https://github.com/sillsdev/silnlp/issues/317#issuecomment-2065809374