Closed davidbaines closed 8 months ago
Hi Issac,
I think we might need to modify the preprocess.py main function, as it needs to accept a force_align parameter from the user to the preprocess function. Something like this:
def main() -> None:
parser = argparse.ArgumentParser(description="Preprocesses the parallel corpus for an NMT model")
parser.add_argument("experiment", help="Experiment name")
parser.add_argument("--stats", default=False, action="store_true", help="Output corpus statistics")
parser.add_argument("--force-align", default=False, action="store_true", help="Rerun alignments even if existing alignments already exist.")
args = parser.parse_args()
get_git_revision_hash()
exp_name = args.experiment
SIL_NLP_ENV.copy_experiment_from_bucket(exp_name)
config = load_config(exp_name)
config.set_seed()
config.preprocess(args.stats,force_align=args.force_align)
SIL_NLP_ENV.copy_experiment_to_bucket(exp_name)
The test_preprocess.py file also calls this function and will need to be updated.
@davidbaines I noticed this issue too, but I believe Isaac added that in this PR https://github.com/sillsdev/silnlp/pull/328. Have you pulled?
Thanks Eli,
I thought that I had. I'm probably on a different branch or something.
All the best, David
On Fri, 23 Feb 2024 at 15:48, Eli C. Lowry @.***> wrote:
@davidbaines https://github.com/davidbaines I noticed this issue too, but I believe Isaac added that in this PR #328 https://github.com/sillsdev/silnlp/pull/328. Have you pulled?
— Reply to this email directly, view it on GitHub https://github.com/sillsdev/silnlp/issues/320#issuecomment-1961465687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAH3UKGICU5KHSN6PVMZXLYVCT3HAVCNFSM6AAAAABC5DWZ4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGQ3DKNRYG4 . You are receiving this because you were mentioned.Message ID: @.***>
An alignment config can contain several source and target texts and the alignments can take hours. There are many times when an alignment run is interrupted part way through.
Currently running the same command again starts from the beginning again, even if many of the alignments have already been calculated. The default behaviour should be to check whether the file has already been created, and not re-calculate.
We could add a --recalculate option to override this default behaviour.
It would also be useful if the script could search for all the results files in the folder and include them in the final scores.csv file even if they are not expected according to the contents of the config.yml file. These two changes would allow much greater flexibility and efficiency in calculating alignment scores.