Closed rminsil closed 1 month ago
This PR fills in some of the blanks in the stubbed script created in https://github.com/sillsdev/silnlp/pull/527
The main parts:
.txt
Example usage:
$ python -m silnlp.common.normalize_extracts /tmp/extracted/ --output /tmp/normalized --log-level DEBUG 2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - INFO - Starting script 2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - INFO - Output dir set to /tmp/normalized 2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - DEBUG - Searching files in input dir: '/tmp/extracted' 2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - INFO - Found 8 files to normalize 2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.all.txt 2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.all.norm.txt 2024-10-05 22:09:16,265 - silnlp.common.normalize_extracts - DEBUG - Found 3942 lines in file 2024-10-05 22:09:16,265 - silnlp.common.normalize_extracts - DEBUG - Writing 3942 sentences to file: /tmp/normalized/swa-2024-08.all.norm.txt 2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.all.txt 2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.train.txt 2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.train.norm.txt 2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Found 3152 lines in file 2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Writing 3152 sentences to file: /tmp/normalized/swa-2024-08.train.norm.txt 2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.train.txt 2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.test.txt 2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.test.norm.txt 2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Found 390 lines in file 2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Writing 390 sentences to file: /tmp/normalized/swa-2024-08.test.norm.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.test.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.val.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.val.norm.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Found 400 lines in file 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Writing 400 sentences to file: /tmp/normalized/swa-2024-08.val.norm.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.val.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.train.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.train.norm.txt 2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Found 3152 lines in file 2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Writing 3152 sentences to file: /tmp/normalized/ngq-2024-08.train.norm.txt 2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.train.txt 2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.test.txt 2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.test.norm.txt 2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Found 390 lines in file 2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Writing 390 sentences to file: /tmp/normalized/ngq-2024-08.test.norm.txt 2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.test.txt 2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.all.txt 2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.all.norm.txt 2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Found 3942 lines in file 2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Writing 3942 sentences to file: /tmp/normalized/ngq-2024-08.all.norm.txt 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.all.txt 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.val.txt 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.val.norm.txt 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Found 400 lines in file 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Writing 400 sentences to file: /tmp/normalized/ngq-2024-08.val.norm.txt 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.val.txt 2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - INFO - Completed script
Currently the normalization itself isn't implemented yet - that will be the next PR.
So this script effectively just copy-pastes the extract files from one location to the new location.
This change is
This PR fills in some of the blanks in the stubbed script created in https://github.com/sillsdev/silnlp/pull/527
The main parts:
.txt
or are already normalizedExample usage:
Currently the normalization itself isn't implemented yet - that will be the next PR.
So this script effectively just copy-pastes the extract files from one location to the new location.
This change is