sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
35 stars 3 forks source link

Issue 473: change how output directory is generated for extract_xri script #507

Closed rminsil closed 2 months ago

rminsil commented 2 months ago

This PR addresses one point raised in the code review feedback from Damien from this comment:

https://github.com/sillsdev/silnlp/pull/491#pullrequestreview-2261754354

I would make the output directory an optional command-line argument. By default, you should extract files to SIL_NLP_ENV.mt_corpora_dir. This should point to the MT/corpora folder in the S3 bucket, which is where we would store this kind of corpus.

I wasn't clear on the directory structure to use within the corpora dir, so I kept the existing logic of generating a unique folder based on the cli inputs + timestamp.

Note that this PR is built off the back of https://github.com/sillsdev/silnlp/pull/506. When that one is merged I'll rebase this one and have it target master.

Once this PR is merged, all code review comments from #491 are addressed and balance is returned to the force.


This change is Reviewable

rminsil commented 2 months ago

SIL_NLP_ENV.mt_corpora_dir is already a Path, so you should be able to do:

output_dir = SIL_NLP_ENV.mt_corpora_dir / unique_dir

Thanks for the tip @ddaspit, there's some newfangled things added to python since I last used it.

I've updated my PR.