sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

Issue 494: add file io for normalize script #549

Closed rminsil closed 1 month ago

rminsil commented 1 month ago

This PR fills in some of the blanks in the stubbed script created in https://github.com/sillsdev/silnlp/pull/527

The main parts:

Example usage:

$ python -m silnlp.common.normalize_extracts /tmp/extracted/ --output /tmp/normalized --log-level DEBUG
2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - INFO - Starting script
2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - INFO - Output dir set to /tmp/normalized
2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - DEBUG - Searching files in input dir: '/tmp/extracted'
2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - INFO - Found 8 files to normalize
2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.all.txt
2024-10-05 22:09:16,264 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.all.norm.txt
2024-10-05 22:09:16,265 - silnlp.common.normalize_extracts - DEBUG - Found 3942 lines in file
2024-10-05 22:09:16,265 - silnlp.common.normalize_extracts - DEBUG - Writing 3942 sentences to file: /tmp/normalized/swa-2024-08.all.norm.txt
2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.all.txt
2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.train.txt
2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.train.norm.txt
2024-10-05 22:09:16,266 - silnlp.common.normalize_extracts - DEBUG - Found 3152 lines in file
2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Writing 3152 sentences to file: /tmp/normalized/swa-2024-08.train.norm.txt
2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.train.txt
2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.test.txt
2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.test.norm.txt
2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Found 390 lines in file
2024-10-05 22:09:16,267 - silnlp.common.normalize_extracts - DEBUG - Writing 390 sentences to file: /tmp/normalized/swa-2024-08.test.norm.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.test.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/swa-2024-08.val.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/swa-2024-08.val.norm.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Found 400 lines in file
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Writing 400 sentences to file: /tmp/normalized/swa-2024-08.val.norm.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/swa-2024-08.val.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.train.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.train.norm.txt
2024-10-05 22:09:16,268 - silnlp.common.normalize_extracts - DEBUG - Found 3152 lines in file
2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Writing 3152 sentences to file: /tmp/normalized/ngq-2024-08.train.norm.txt
2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.train.txt
2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.test.txt
2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.test.norm.txt
2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Found 390 lines in file
2024-10-05 22:09:16,269 - silnlp.common.normalize_extracts - DEBUG - Writing 390 sentences to file: /tmp/normalized/ngq-2024-08.test.norm.txt
2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.test.txt
2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.all.txt
2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.all.norm.txt
2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Found 3942 lines in file
2024-10-05 22:09:16,270 - silnlp.common.normalize_extracts - DEBUG - Writing 3942 sentences to file: /tmp/normalized/ngq-2024-08.all.norm.txt
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.all.txt
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Processing file /tmp/extracted/ngq-2024-08.val.txt
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Outputting to /tmp/normalized/ngq-2024-08.val.norm.txt
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Found 400 lines in file
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Writing 400 sentences to file: /tmp/normalized/ngq-2024-08.val.norm.txt
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - DEBUG - Finished processing /tmp/extracted/ngq-2024-08.val.txt
2024-10-05 22:09:16,271 - silnlp.common.normalize_extracts - INFO - Completed script

Currently the normalization itself isn't implemented yet - that will be the next PR.

So this script effectively just copy-pastes the extract files from one location to the new location.


This change is Reviewable