sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 3 forks source link

When calculating alignments preserve the isocode from the filename in the corpus-stats.csv file. #317

Closed davidbaines closed 5 months ago

davidbaines commented 7 months ago

Currently files listed in a config.yml file for calculating alignments are listed in the corpus-stats.csv file without the "iso-" prefix. When I wish to include the best scoring file in an experiment I need to refer back to the original config file to find the name of the file. See also : https://github.com/sillsdev/silnlp/issues/128

davidbaines commented 7 months ago

It would also be very helpful to have in the corpus-stats.csv a total count of all the verses that contain data in the source text and the target text. That allows a quick judgement as to whether a text is NT only or also contains the OT.

davidbaines commented 7 months ago

Also useful would be an indication of the script of the text of each file. Very useful would be how that script maps to the scripts available in NLLB. All these additions are steps towards automating our process. These data are necessary for creating a config file for running the experiment. https://github.com/davidbaines/textinfo/blob/master/python/charfreq_2.py contains code that will count the scripts used in a file.

isaac091 commented 7 months ago

@davidbaines This morning I mentioned that when the extra stats get pulled in for pairs not listed in the config file, the information will be incomplete in corpus-stats.csv. If I disambiguate the project names by adding the ISO codes to the file name, I can get the total project verse count by getting the whole corpus again. However, I still wouldn't be able to get the other missing pieces of information, which are the unfiltered count and alignment score, because the pair isn't tied to an experiment so there's no way to know what filtering comes from specifying books vs giving an alignment threshold.

All that said, would it still be helpful to add the ISO codes to the filenames of the individual stats files? It's a quick fix but I didn't want to add it without asking in case it would be more inconvenient, especially because there's not much to be gained from it in terms of additional statistics.

davidbaines commented 5 months ago

It would be useful to add the iso code. I think that the alignments are calculated from an extracted file, and that file contains the iso code as part of the filename. That is the filename, less the .txt extension, which appears in the config.yml file. The code seems to remove that information from the corpus-stats.csv file.

Each file contains a total number of "verses-with-text", and it would be good to include that in a 'source: total verses' column.

It would be useful to know the maximum number of verses available for training: - i.e. how many verses-with-text in the source also have text in the target.

It would be useful to know how much data is available for drafting. i.e. how many verses-with-text in the source are empty in the target.

isaac091 commented 5 months ago

Opening a new issue for these suggestions.