statmike / vertex-ai-mlops

Google Cloud Platform Vertex AI end-to-end workflows for machine learning operations
Apache License 2.0
450 stars 202 forks source link

02c: change of region causes problems. #54

Open hymanroth opened 9 months ago

hymanroth commented 9 months ago

Hi Mike, and thanks for this excellent series of tutorials.

I got as far as 02b when I started having resource availability issues in us-central1. After several days of not being able to open the notebook I decided to start again in a new region (europe-west6). I began with 00-setup and got as far as 02b without any issues. However, several pipeline tasks in 02c failed because if a region is not explicitly specified in a task the location defaults to us-central1. Hence the executors were looking for artifacts in the wrong region.

To cut a long story short, I only managed to complete the pipeline successfully by explicitly specifying a location in each task:

   # dataset
    dataset = TabularDatasetCreateOp(
        location = REGION,
        project = project,
        ....
    )

    # training
    model = AutoMLTabularTrainingJobRunOp(
        location = REGION,
        project = project,
        ...
    )

    # Endpoint: Creation
    endpoint = EndpointCreateOp(
        location = REGION,
        project = project,
        ...
    )

At first I tried explicitly setting the location in the top-level pipeline definition in the hope that this would cause the location to be inherited by the underlying tasks, but this didn't work. Perhaps there is another way of provoking this behavior....

As an aside, the solution above involved running the pipeline several times because I had to wait for each task to complete before I could verify that the next one was ok. This meant I trained the model (with identical data) three times, with a 2 hour wait each time. It was only later that I realized that Vertex pipelines have a cache which can be used to skip over repeated invocations of the same task, The reason the cache was not used was because you use a timestamp as part of the pipeline id. This is the equivalent of a "cache-buster" and excludes the cache by default. I would suggest using a less volatile pipeline id (eg a an explicit version number) and add a note to explain how the timestamp can be used to provoke a complete recalculation of the pipeline if necessary.

Thanks again for the work you have done here, it's super useful!

statmike commented 9 months ago

Thank you @hymanroth , This series definitely need an update and I have plans for a full update path and expansion of the AutoML notebooks. I will incorporate a fix for the GCP provided components needing location explicitly specified since they do not inherit. In the mean time I did update the readme for the AutoML folder to include notes about AutoML resources consideration by location. Hopefully this update will happen before November - a few things are ahead of it in the TensorFlow series, BQML and Applied GenAI.