create new environment to support hyperdrive

informatics-lab / precip_rediagnosis

Project to use ML to re-diagnose precipitation fields from ensemble model fields

0 stars 0 forks source link

create new environment to support hyperdrive #71

Closed stevehadd closed 2 years ago

stevehadd commented 2 years ago

Current notebooks have some issues with reloading trained model in the notebook. There seem to e issues with setting up the corect env with everything and then running in a notebook. Need to investigate the correct steps to do this reliably.

stevehadd commented 2 years ago

use this notebook to test the hyperdrive environment: https://github.com/informatics-lab/precip_rediagnosis/blob/fraction_pipeline/fractions_model_pipeline/prd_mlops_azml_cluster_hyperdrive_demo_fractions.ipynb

stevehadd commented 2 years ago

I've run. the above notebook with the notebook using a conda environment defined by requirements_model_dev_azml.yml and using the prd_ml_cluster environment for running the hyperdrive stuff on the cluster, which is defined by the prd_ml_cluster.ymll)

So I think our existing requirements files seem to be working OK with HyperDrive. @hannahbrown7 I'd be interested to know if this combination works OK for you or not?

hannahbrown7 commented 2 years ago

I have now tested the hyperdrive demo with the environments suggest above by @stevehadd. This work fine for running a hyperdrive experiment on a cluster, so long as we do not want to reload the model directly from the run, using the following code.

with tempfile.TemporaryDirectory() as td1:
    td_path = pathlib.Path(td1)
    print(td_path)
    prd_run.download_files(prefix=prd_model_name, output_directory=td1)
    model_path = td_path / prd_model_name
    print(model_path)
    list(model_path.iterdir())
    trained_model = tensorflow.keras.models.load_model(model_path)
trained_model

This approach to reloading the model requires version tensorflow >= 2.7. Even using a customer environment and explicitly define the TF version required, I am still getting version 2.2, I assume due to a clash with azureml.train.

I don't think is requires any more follow up, just something to be aware of. The alternative of using hyperdrive experiment result to inform hyperparameter selection is still useful.

hannahbrown7 commented 2 years ago

One to flag with Microsoft perhaps

stevehadd commented 2 years ago

Thanks for that update @hannahbrown7 . I don't think the model loading stuff was in the notebook when I was testing the environment so I didn't test that functionality.

My main question then is what is there that you ant to do that you cannot do? Are you able to get the model from the hyperdrive run some other way?

I definitely think loading the model (or indeed any of the models) trained through the huyperdrive should eb doable (indeef the idea is that they should portable and we should be able to load it in a different environment for inference e.g. spice or a local macbook), so I'll investigate further about getting the correct versions of tensorflow

hannahbrown7 commented 2 years ago

No it wouldn't have been, I had removed it when it was reviewed previously because it was not working.

My main question then is what is there that you ant to do that you cannot do? Well, it would be useful to be able to get the model from the hyperdrive run somehow. Currently what I have been doing is running hyperdrive to get the best hyperparameters (as shown in the notebook) then train a model with those hyperparameters.

stevehadd commented 2 years ago

OK, I have now completing trying this with the following notebook on my dev branch (this is using the mean prediction model, rather than the fractional one, which should be the same from an environment point of view but I will check it shortly just in case). https://github.com/informatics-lab/precip_rediagnosis/blob/prd85_azml_tensorboard/model_pipeline/prd_mlops_azml_cluster_hyperdrive_demo.ipynb

I was using the requirements_model_dev_azml.yml env for the notebook and and using the prd_ml_cluster environment for running the hyperdrive stuff on the cluster, which is defined by the prd_ml_cluster.ymll), both of which have tensorflow 2.8 installed, and I was able to load in a model from the best hyperdrive problem no problem. You might have something weird going on with your environment @hannahbrown7 , we'll take a look in the meeting shortly.