Configure builds for saving models, inference only and custom starting points

johnml1135 commented 1 year ago

Without building - just direct. How do we choose between these different models, etc? How do we configure? A new engine type? A new "NmtInference" type?

johnml1135 commented 1 year ago

There will be enough other custom configurations, we will really need to be able to add a specific configuration file to the job when building. We could do this by actually adding a new place to add a new field when building for "configurationFile" that references a file separately uploaded. This will likely be the best way - with some examples in the documentation of the parameters that can be adjusted and what they do. Upon this, some automation could be built to create the proper config for each situation.

johnml1135 commented 1 year ago

Scope of configurability:

Be able to train or not train; infer or not infer
Select a starting point for a build (either a previous build or a selection of preconfigured presets)
Be able to live inference off of an existing build
Be able to choose diverse sets of corpora (already available)

Proposal - control it all when building:

Options for controlling the build:	corpora	sourceFiles	targetFiles	pretranslate	saveModel
false	x	x	x	false	error - need to pretranslate sourceFiles to batch inference or to save model for live inference
true	false	x	x	false	error - need to pretranslate sourceFiles to batch inference or to save model for live inference
true	true	x	false	false	error - need to pretranslate sourceFiles to batch inference or to save model for live inference
x	x	x	x	true	model saved - original (or trained - if there is training) is saved for future use (inference or fine tuning)
true	true	x	true	x	pretranslate - original (or trained if there is training) model will pretranslate the sourceFiles
true	true	true	x	x	train - model will be fine tuned on the corpora

Further build options include:

A choice of starting point:
1. A small selection (< 5) models with hyperparameters, such as NLLB 200 1.3Gb with this and that batch size...
2. A previous saved build from any engine called out by the build id
3. A file of type "nmt_config" that contains a custom machine.py configuration called out by the file id. This will only be for advanced configuration as performed by the Serval team and will not be available for users or admins to modify (normally).
(when suggestions are present) generateWordGraphs (same mechanism as pretranslate - batch mode)
Obviously the existing corpora choices

Note: saved models can be used for a future build, future batch inferencing or future live inferencing.

johnml1135 commented 1 year ago

This also accommodates:

Live inference from the "translate" engine endpoint off of the last successfull build, based upon the src and trg languages of the engine
"incremental updates" where you train the model from the last build with a bit more corpora. Will likely need a flag called "incrementalBuild" or similar so that the proper configuration is set (per model type) to allow for fast, incremental fine tuning.
word graph suggestions, both batch and live.

johnml1135 commented 1 year ago

For machine.py, I believe the updates include:

Saving and recalling models based upon build ID from the S3 bucket
- Put in a folder per engine ID. When a new one is created, delete all old models?
Skipping fine tuning if there is no target (or would machine just set do_train: false?)
Any changes needed to receive a config.yml file

johnml1135 commented 1 year ago

For Serval, the changes would be:

Add support for "BuildConfigId" - choosing one of a set of default configurations, a previous build or (provided) yaml file
Checking if "modelConfiguration" file or build exists (check if it follows the specific ID model and then assume there will be one)
Adding support for a "BuildConfig" file type

johnml1135 commented 1 year ago

Serval will pass the BuildConfigID directly to Machine Machine will see if BuildConfigID is a fileId - if so, it will put the file on the S3 bucket and name it config.yaml If not, Machine will pass the BuildConfigId to Machine.py as "BuildConfig" parameter. Machine.py will parse and interpret the BuildConfig as:

A previous build based upon the saved models on the S3 bucket (potential issue - there could be a client conflict - one client asks for a build from another client and gets it. Nothing too bad should be able to happen though - you can't even get what language is being worked on. It would just give garbage data.)
Machine.py takes the first portion of the BuildConfig and sees if there is a model on Huggingface that corresponds with it. If no, fail.
There could be multiple portions of the build config (ModelName:flag1:key2=value2:etc.). Machine.py parses it and implements it - if it does not undersand, it throws a intelligent error.

ddaspit commented 1 year ago

I don't think we should add a configuration file. We should just add a configuration property to the engine that can be used for custom configuration.

johnml1135 commented 1 year ago

The issue there is that (if I am understanding it properly), we will need to update Serval every time there is a change in the configuration. Say also that we end up supporting some form of ChatGPT and "configuration" looks like a starting prompt and a few parameters. This format (a file of arbitrary format that is checked and processed by the end code and documented well) I believe would be most flexible and usable while requiring minimal changes. We could update Machine.py with a new flag in a day, and then have that flag be called out by a custom file - without having to touch one line of dotnet code.

ddaspit commented 1 year ago

The configuration property would just hold arbitrary json. Each engine implementation would be responsible for "parsing" it.

johnml1135 commented 1 year ago

Damien, your idea for putting the configuration as a json blob is growing on me. It makes it not need a separate file, and works really well when there are just a few parameters. Moreover, nothing prevents the configuration file type from being added later. It also makes the configuration readable from just one query of the build.

ddaspit commented 1 year ago

At this point, I don't think we will need to support a lot of parameters.

johnml1135 commented 1 year ago

This is what we can implement now:

updates to Serval and Machine

A parameter called BuildConfig in the RestAPI is added - and can support genaric json
- Serval takes the json and confirms that it is valid JSON (or throws an error). It converts the json object into a string and passes it to machine (if a json object works better for grpc, do it).
- Machine interprets it as needed (SMT and NMT differently)
- The two initial parameters supported should be about saving models and using previous builds.
- These parameters should be documented in Swagger.
No "saveModel" parameter - it should be in the JSON
- The exact location can be worked out with Matthew - should correspond to the yaml file location - direct mapping.
Machine makes the determination if the build configuration is appropriate for the specific type of engine
Machine should validate the build configuration in the following way:
- Implement the logic above - saving, source corpora, etc.
Pass the json to Machine.py when running the job
Add E2E test inferencing off of NLLB-200
Updates to Machine.py
If requested, save models to local storage (not S3 bucket because of size). Name them by their buildId.
Add ability to inference only from the NLLB model
Add ability to inference only from previous saved build

johnml1135 commented 1 year ago

Also for checking if an engine has been build - from #97 - from @ddaspit

When certain translation engine endpoints are called, Serval will check if the engine was built and fail if it hasn't. Serval should not be performing this check. It is the responsibility of the engine to determine if it is in an invalid state. There are valid cases where an engine might be able to respond to these requests even when it hasn't been built. These checks should be performed by the appropriate engine.

Enkidu93 commented 1 year ago

The beginnings of a Serval-/Machine-side solution: https://github.com/sillsdev/serval/compare/main...config_passing_%2345 and https://github.com/sillsdev/machine/compare/master...config_passing_serval%2345

johnml1135 commented 1 year ago

Great start! We will need SMT options, primarily for choosing the SMT model to use.

mshannon-sil commented 1 year ago

It was decided that machine.py will not need to receive extra parameters to do inference from NLLB. Instead, it will check if there is no training data provided, and in that case just do inferencing from NLLB.

Also, now that we have the build_options parameter, we should move the model_type and max_steps parameters inside the json for build_options.

mshannon-sil commented 1 year ago

@ddaspit Since we're moving the model_type and max_steps parameters to the build_options parameter, should we make build_options a required argument, or should it be optional with defaults for model_type and max_steps? Currently, max_steps has a default of 20000 steps, and there is no default for model_type.

Enkidu93 commented 1 year ago

You're talking about machine.py? In Machine, there are defaults of "huggingface"/20,000 here.

mshannon-sil commented 1 year ago

Yes to clarify machine.py is the one without a default setting for model_type in machine/jobs/settings.yaml. Since machine does have a default, it sounds like I should add that default to machine.py and keep the build_options argument as an optional argument.

ddaspit commented 1 year ago

Yes, I would make it optional. We could merge the build_options into the job config. This would allow us to override any setting.

mshannon-sil commented 1 year ago

@Enkidu93 Update to the plan, Damien would like us to only move max_steps into build_options and keep model_type outside of build_options. This way, different model types can have different sets of build options.

Enkidu93 commented 1 year ago

@Enkidu93 Update to the plan, Damien would like us to only move max_steps into build_options and keep model_type outside of build_options. This way, different model types can have different sets of build options.

Done. See latest commit.

mshannon-sil commented 1 year ago

@Enkidu93 I got the E2E tests running finally, but the tests are failing since machine.py expects json for build_options to be in string format, not pure json like Machine is sending. We could either have Machine convert the json to str before passing arguments to machine.py, or machine.py could handle str and non-str versions of json. @ddaspit @johnml1135 Do you have a preference from an architectural standpoint?

ddaspit commented 1 year ago

I don't really have an opinion either way. I would just go ahead and update Machine.py to support both string and object for build_options.

mshannon-sil commented 1 year ago

Okay I'll do that. Looks like the PR has already been merged, I should have changed myself to blocking on it, sorry about that. I'll just make another PR on the same branch once this is implemented.

mshannon-sil commented 1 year ago

In a meeting today, Damien, John, and I ended up deciding it would be better for Machine to pass the build_options as a string, so I added a commit for that in Machine's config_passing branch.

Enkidu93 commented 1 year ago

In a meeting today, Damien, John, and I ended up deciding it would be better for Machine to pass the build_options as a string, so I added a commit for that in Machine's config_passing branch.

OK, great, thank you. What (if anything) can I do to help here? Or this about wrapped up then? I still need to make an edit so the build options are saved to MongoDB as JSON, but other than that?

mshannon-sil commented 1 year ago

Nothing else to do besides creating pull requests in Machine and Serval as soon as you're ready.

Enkidu93 commented 1 year ago

I just pushed to config_passing. Now things are stored appropriately in the db.

sillsdev / serval

Configure builds for saving models, inference only and custom starting points #45

Proposal - control it all when building:

updates to Serval and Machine

Updates to Machine.py