microsoft / solution-accelerator-many-models

MIT License
193 stars 85 forks source link

custom script - parallelrunstep not working #118

Open michalmar opened 3 years ago

michalmar commented 3 years ago

when I try run the example Custom_Script/02_CustomScript_Training_Pipeline.ipynb I cannot create ParallelRunConfig

parallel_run_config = ParallelRunConfig(
    source_directory='./scripts',
    entry_script='train.py',
    mini_batch_size="1",
    run_invocation_timeout=timeout,
    error_threshold=10,
    output_action="append_row",
    environment=train_env,
    process_count_per_node=processes_per_node,
    compute_target=compute,
    node_count=node_count)

it gives error:

in /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/pipeline/steps/parallel_run_config.py
...
TypeError: __init__() got an unexpected keyword argument 'allowed_failed_count'

I have updated to latest SDK (pipeline):

zureml-pipeline-core==1.22.0
azureml-pipeline-steps==1.22.0

when I downgrade to 1.20.0 it works:

zureml-pipeline-core==1.20.0
azureml-pipeline-steps==1.20.0

so fix is:

!pip install update azureml-pipeline-steps==1.20.0
dkmiller commented 3 years ago

The official docs for ParallelRunConfig still show that keyword argument: azureml.pipeline.steps.ParallelRunConfig.

I wonder if you're using the stale one — azureml.contrib.pipeline.steps.parallel_run_config.ParallelRunConfig?

michalmar commented 3 years ago

@dkmiller how can I check the stale one?

dkmiller commented 3 years ago

Look in your script to see from where you are importing the ParallelRunConfig.

Also, suggest you make sure to pull the latest version of this repo.

michalmar commented 3 years ago

I am using official not staled repo: from azureml.pipeline.steps import ParallelRunConfig

repo cloned couple days ago - so not sure where the problem comes from..

dkmiller commented 3 years ago

I could not reproduce this. Try this "clean" Dockerfile:

FROM python:3.8

RUN pip install azureml-pipeline-steps==1.22.0 azureml-pipeline-core==1.22.0

RUN python -c "from azureml.pipeline.steps import ParallelRunConfig; cfg = ParallelRunConfig(allowed_failed_count=1,entry_script='hi.py',environment='foo',error_threshold=1,output_action='append_row',compute_target='cluster',node_count=1)"

Docker build fails with:

ValueError: Parameter environment must be an instance of azureml.core.Environment. The actual value is foo.

which means that there is no problem with the keyword allowed_failed_count. I'd suggest re-creating your Python environment.

michalmar commented 3 years ago

@dkmiller I am running on AML CI - should I create new conda env?

dkmiller commented 3 years ago

Yes, I'd suggest creating a new Conda environment from scratch. Follow this article to expose that Conda environment as a Jupyter kernel: https://medium.com/@nrk25693/how-to-add-your-conda-environment-to-your-jupyter-notebook-in-just-4-steps-abeab8b8d084 .