sillsdev / machine.py

Machine is a natural language processing library for Python that is focused on providing tools for processing resource-poor languages.
MIT License
10 stars 2 forks source link

Alignment Job #114

Closed johnml1135 closed 2 weeks ago

johnml1135 commented 1 month ago

Add an SMT alignment job This lays the groundwork for https://github.com/sillsdev/serval/issues/410.


This change is Reviewable

johnml1135 commented 4 weeks ago

machine/jobs/build_word_alignment_model.py line 7 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
This should be a relative import.

Done.

johnml1135 commented 4 weeks ago

machine/jobs/engine_build_job.py line 13 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
I don't know how much this class buys us, since it doesn't really have any logic in it. It just provides a skeleton. I am also not a big fan of passing variables between methods using class fields. It makes the dependencies between the methods less explicit and more error-prone. If we do use this class, we should provide different implementations for MT and word alignment.

Renamed to TranslationEngineBuildJob - and removed it from WordAlignmentBuildJob.

johnml1135 commented 4 weeks ago

machine/jobs/engine_build_job.py line 47 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
All of these "protected" methods should be prefixed with an underscore.

Done.

johnml1135 commented 4 weeks ago

machine/jobs/nmt_engine_build_job.py line 113 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
Why are you ignoring the type checking here?

Changed the typing check for write - to just look for an object.

johnml1135 commented 4 weeks ago

machine/jobs/settings.yaml line 30 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
We will probably want different settings for the SMT engine and the word alignment models. We should add a new section, maybe `thot_align`.

Done.

johnml1135 commented 4 weeks ago

machine/jobs/shared_file_service.py line 41 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
I would define separate classes for MT and word alignment. You can still define a base class if there is some shared code.

Ugh - took a while.

johnml1135 commented 4 weeks ago

machine/jobs/word_alignment_build_job.py line 84 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
What is type checking being ignored here?

There was a typo - I found it and it resolved the core issue.

johnml1135 commented 4 weeks ago

machine/jobs/smt_model_factory.py line 14 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
The `SmtModelFactory` class is intended to be decoupled from any specific SMT implementation, such as Thot. Also, you should define a separate word alignment model factory.

Done.

codecov-commenter commented 4 weeks ago

Codecov Report

Attention: Patch coverage is 80.13356% with 119 lines in your changes missing coverage. Please review.

Project coverage is 88.00%. Comparing base (3912c6a) to head (9567830).

Files Patch % Lines
machine/jobs/nmt_engine_build_job.py 52.27% 21 Missing :warning:
machine/jobs/shared_file_service_base.py 64.28% 20 Missing :warning:
machine/jobs/translation_file_service.py 68.29% 13 Missing :warning:
machine/tokenization/tokenizer_factory.py 75.55% 11 Missing :warning:
machine/jobs/word_alignment_file_service.py 61.53% 10 Missing :warning:
machine/jobs/clearml_shared_file_service.py 30.76% 9 Missing :warning:
machine/jobs/shared_file_service_factory.py 60.00% 6 Missing :warning:
...ranslation/thot/thot_word_alignment_model_utils.py 50.00% 6 Missing :warning:
machine/jobs/smt_model_factory.py 58.33% 5 Missing :warning:
machine/jobs/word_alignment_model_factory.py 78.26% 5 Missing :warning:
... and 5 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #114 +/- ## ========================================== + Coverage 87.81% 88.00% +0.18% ========================================== Files 249 259 +10 Lines 15082 15445 +363 ========================================== + Hits 13245 13593 +348 - Misses 1837 1852 +15 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

johnml1135 commented 3 weeks ago

machine/jobs/build_nmt_engine.py line 61 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
It would be better to use the enum value for the type instead of the string.

Done.

johnml1135 commented 3 weeks ago

machine/jobs/build_word_alignment_model.py line 58 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
It would be better to use the enum value for the type instead of the string.

Done.

johnml1135 commented 3 weeks ago

machine/jobs/translation_engine_build_job.py line 51 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
This is hacky. As I stated before, there are issues with passing parameters between methods using class fields. You can create these objects at the beginning of the `run` method and pass them as parameters to the appropriate methods.

I moved them to properties. That appears more pythonic and saves having to move the source, target and parallel corpuses around.

johnml1135 commented 3 weeks ago

machine/jobs/translation_file_service.py line 7 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
You missed a couple of relative imports.

Found all and fixed them all.

johnml1135 commented 3 weeks ago

machine/jobs/translation_file_service.py line 25 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
Do we need to support a string value? Just the enum should be sufficient.

Changed to constants.

johnml1135 commented 3 weeks ago

machine/jobs/word_alignment_build_job.py line 55 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
See my comment in the `TranslationEngineBuildJob` class.

Also updated.

johnml1135 commented 3 weeks ago

machine/jobs/thot/thot_word_alignment_model_factory.py line 18 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
We will probably want a different new model template for word alignment models. The SMT template contains many files that are specific to SMT models. Should I provide you with a new alignment model template?

Yes, please. I wouldn't know what was needed (unless I spent many hours).

johnml1135 commented 3 weeks ago

machine/jobs/thot/thot_word_alignment_model_factory.py line 42 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
You should use the model type from the `thot_align` section.

Done - and changed it over for the tokenizer as well.

johnml1135 commented 3 weeks ago

I believe the last thing is the new word alignment template.

johnml1135 commented 3 weeks ago

machine/tokenization/tokenizer_factory.py line 28 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
These functions should be exported from the `tokenization` package.

Done.

johnml1135 commented 3 weeks ago

machine/jobs/build_word_alignment_model.py line 88 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
This return value doesn't seem to be used anywhere. Are you returning this intentionally?

It's still a work in progress - I'll make a few more changes.

johnml1135 commented 3 weeks ago

machine/jobs/translation_file_service.py line 25 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
The constants are a good change. I was actually referring to the `type` parameter. I don't think we need to accept a `str`.

I believe that we need to based upon them being defined in SETTINGS.

johnml1135 commented 3 weeks ago

machine/jobs/translation_engine_build_job.py line 51 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
The `create_source_corpus` and `create_target_corpus` methods actually download files from the S3 bucket. It would be unexpected behavior and a code smell for a property to perform long-running processing. Also, you shouldn't reference fields by their name, since this can easily result in a bug if the field name is ever changed. The easiest way forward is to remove this class and create the corpora in the `run` method of the `SmtEngineBuildJob` and `NmtEngineBuildJob` classes. This class adds complexity without adding much value. The other option is to create the corpora at the beginning of the `run` method and pass the variables to the appropriate methods.

Makes sense - updated.

johnml1135 commented 3 weeks ago

machine/jobs/build_word_alignment_model.py line 88 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…
It's still a work in progress - I'll make a few more changes.

I'm slimming it down a bit.

johnml1135 commented 2 weeks ago

machine/jobs/settings.yaml line 31 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
I know I told you to create a `thot_align` section, but I'm still not happy with the way that the settings are structured. I don't think the section names are clear enough. What about if we change the `thot` and `huggingface` sections to `thot_mt` and `huggingface_mt`. That would clearly differentiate the MT configs from the alignment configs. Or we could have separate `mt` and `align` sections under the `thot` section.

Done.

johnml1135 commented 2 weeks ago

machine/jobs/translation_file_service.py line 25 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
It looks like all calls to the `TranslationFileService` constructor pass a `SharedFileServiceType` enum value. A `str` value is never passed.

Yes - you are right. It has to do with what calls it, not the underlying settings. Updated for both.

johnml1135 commented 2 weeks ago

machine/jobs/thot/thot_word_alignment_model_factory.py line 18 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
Now that I think about it, I don't think we need a new model template. You can just remove the `init` method altogether.

Done.