sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 3 forks source link

Issue 494: Add stubbed normalization script #527

Open rminsil opened 1 week ago

rminsil commented 1 week ago

This PR adds an initial stubbed out script to normalize extract files.

It is the first step to implement the second half of this comment: https://github.com/sillsdev/silnlp/issues/494#issuecomment-2345328525

Normalization Script Responsibilities

  • Convert multiple consecutive spaces to a single space
  • Correct inconsistent spacing around punctuation

The main parts of the logic are stubbed out so when you run the script it doesn't do much:

$ python -m silnlp.common.normalize_extracts /tmp/xri --output hackery/normalized
2024-09-19 17:08:17,470 - silnlp.common.normalize_extracts - INFO - Starting script
2024-09-19 17:08:17,470 - silnlp.common.normalize_extracts - INFO - Output dir set to hackery/normalized
2024-09-19 17:08:17,470 - silnlp.common.normalize_extracts - INFO - Found 0 files to normalize
2024-09-19 17:08:17,470 - silnlp.common.normalize_extracts - INFO - Completed script

The aim of the PR is to get verification from Damien and Michael that the script interface and general approach makes sense as there hasn't been any discussions on those specific details. Once that is sorted, I'll fill in the other parts in follow up PR's, hence the stubbing.


This change is Reviewable