sillsdev / machine

Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages.
MIT License
26 stars 15 forks source link

Smt on clearml #200

Closed johnml1135 closed 4 months ago

johnml1135 commented 5 months ago

There are a still a bug or two to work out - but it should be good enough to review. There will likely be more churn in review at this point than from bugs.


This change is Reviewable

johnml1135 commented 5 months ago

src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

Previously, johnml1135 (John Lambert) wrote…
I changed it back to .zip. Dotnet handles .zip easier than tar.gz and python doesn't care.

Correction - brought back to gzip - just needed to use a new dotnet library to make the code of sensible size.

johnml1135 commented 4 months ago

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
I don't think that deleting the Machine database is an option. It would break all of the existing Serval engines. I believe that this is the only breaking DB change. I actually think that we might be able to get by without this property. It is only used when starting a build job. We should be able to pass in the engine type when we call `IBuildJobService.StartBuildJobAsync`. Then one place where it is tricky is when we start the postprocess job from the `ClearMLMonitorService`. If we have some way to differentiate between the engine types from the ClearML task metadata, then it should be doable.

The only concern would be historic jobs that died during the job but because of the hangfire restart issue, are just languishing in an incomplete state. When the hangfire issue is fixed, the jobs will start back up again and will fail because they don't have a translation engine type. That is why I am considering just killing the jobs that never completed. In terms of differentiating, I don't know if we can reliably. We could update the database manually by stripping the engine type from Serval and putting into the mongo DB, but that would be an update script, or a database conversion. Do we want to go down that route? Again, I recommend just deleting all jobs from Machine.

johnml1135 commented 4 months ago

src/SIL.Machine.AspNetCore/Services/ThotSmtModelFactory.cs line 72 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
You can use the `GZipStream` and `TarFile` classes to extract the `.tar.gz` file. [Here](https://www.nikouusitalo.com/blog/how-to-natively-read-tgz-files-with-the-new-c-tarreader-class/) is an example that uses `TarReader`, but it should be easy to adapt to `TarFile`.

done (by you).

johnml1135 commented 4 months ago

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…
We should meet to come up with a plan.
  1. Write python script to delete "currentBuild" from every translation engine and remove the machine_jobs hangfire database.
  2. Test on Internal QA: using 1.4.5, (1) run a set of E2E tests. (2) Spin down machine. (3) Run Python script (6) spin up 1.5.0 (7) rebuild an existing SMT and NMT engine. (8) run all E2E tests again.
  3. For production, (1) make sure there are no running builds (2) spin down Serval (3) run script (4) Spin up Serval 1.5
johnml1135 commented 4 months ago

src/SIL.Machine.AspNetCore/Models/TranslationEngine.cs line 8 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…
1. Write python script to delete "currentBuild" from every translation engine and remove the machine_jobs hangfire database. 2. Test on Internal QA: using 1.4.5, (1) run a set of E2E tests. (2) Spin down machine. (3) Run Python script (6) spin up 1.5.0 (7) rebuild an existing SMT and NMT engine. (8) run all E2E tests again. 3. For production, (1) make sure there are no running builds (2) spin down Serval (3) run script (4) Spin up Serval 1.5

https://github.com/sillsdev/serval/issues/407

johnml1135 commented 4 months ago

Reverted.