When calculating --stats with the preprocess function don't recalculate existing results and include all results in the final scores.

sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.

Other

34 stars 3 forks source link

When calculating --stats with the preprocess function don't recalculate existing results and include all results in the final scores. #320

Closed davidbaines closed 8 months ago

davidbaines commented 9 months ago

An alignment config can contain several source and target texts and the alignments can take hours. There are many times when an alignment run is interrupted part way through.

Currently running the same command again starts from the beginning again, even if many of the alignments have already been calculated. The default behaviour should be to check whether the file has already been created, and not re-calculate.

We could add a --recalculate option to override this default behaviour.

It would also be useful if the script could search for all the results files in the folder and include them in the final scores.csv file even if they are not expected according to the contents of the config.yml file. These two changes would allow much greater flexibility and efficiency in calculating alignment scores.

davidbaines commented 8 months ago

Hi Issac,

I think we might need to modify the preprocess.py main function, as it needs to accept a force_align parameter from the user to the preprocess function. Something like this:

def main() -> None:
    parser = argparse.ArgumentParser(description="Preprocesses the parallel corpus for an NMT model")
    parser.add_argument("experiment", help="Experiment name")
    parser.add_argument("--stats", default=False, action="store_true", help="Output corpus statistics")
    parser.add_argument("--force-align", default=False, action="store_true", help="Rerun alignments even if existing alignments already exist.")

    args = parser.parse_args()

    get_git_revision_hash()

    exp_name = args.experiment
    SIL_NLP_ENV.copy_experiment_from_bucket(exp_name)
    config = load_config(exp_name)

    config.set_seed()
    config.preprocess(args.stats,force_align=args.force_align)
    SIL_NLP_ENV.copy_experiment_to_bucket(exp_name)

The test_preprocess.py file also calls this function and will need to be updated.

Enkidu93 commented 8 months ago

@davidbaines I noticed this issue too, but I believe Isaac added that in this PR https://github.com/sillsdev/silnlp/pull/328. Have you pulled?

davidbaines commented 8 months ago

Thanks Eli,

I thought that I had. I'm probably on a different branch or something.

All the best, David

On Fri, 23 Feb 2024 at 15:48, Eli C. Lowry @.***> wrote:

@davidbaines https://github.com/davidbaines I noticed this issue too, but I believe Isaac added that in this PR #328 https://github.com/sillsdev/silnlp/pull/328. Have you pulled?

— Reply to this email directly, view it on GitHub https://github.com/sillsdev/silnlp/issues/320#issuecomment-1961465687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH3UKGICU5KHSN6PVMZXLYVCT3HAVCNFSM6AAAAABC5DWZ4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGQ3DKNRYG4 . You are receiving this because you were mentioned.Message ID: @.***>