googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.76k stars 1.29k forks source link

aiplatform: ImportDataRequest missing ImportSchemaUri for tabular data #7973

Open galeone opened 1 year ago

galeone commented 1 year ago

I have created the empty dataset (using the Go SDK, but maybe the language doesn't matter), and the next step is to associate the data source with the dataset. When creating the empty dataset, I specified the "metadata schema URI" as gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml. The empty dataset has been created with no problem, and it appeared on the vertex AI dashboard.

The data is already uploaded to a bucket on Google Storage. So far so good. However the "import data request" requires the attribute "Import Schema URI" to have a valid value.

Unfortunately, all the available schemas are not for tabular data AFAIK. When listing the content of the bucket gs://google-cloud-aiplatform/schema/dataset/ioformat in fact, there's nothing about tabular data.

Am I doing something wrong or is this schema for the IO of tabular data missing?

e.g.

import (
    vai "cloud.google.com/go/aiplatform/apiv1beta1"
    vaipb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
)

// lots of other code

// Create the dataset
// ref: https://pkg.go.dev/cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb#CreateDatasetRequest
req := &vaipb.CreateDatasetRequest{
    // Required. The resource name of the Location to create the Dataset in.
    // Format: `projects/{project}/locations/{location}`
    Parent: fmt.Sprintf("projects/%s/locations/%s", _vaiProjectID, _vaiLocation),
    Dataset: &vaipb.Dataset{
        DisplayName:       userDataset,
        Description:       fmt.Sprintf("User %d data", user.ID),
        MetadataSchemaUri: "gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml",
    },
}
var createDatasetOp *vai.CreateDatasetOperation
if createDatasetOp, err = datasetClient.CreateDataset(ctx, req); err != nil {
    return err
}
if dataset, err = createDatasetOp.Wait(ctx); err != nil {
    return err
}
// So far OK!

importDataReq := &vaipb.ImportDataRequest{
    Name: datasetFullPath,
    ImportConfigs: []*vaipb.ImportDataConfig{
        {
            Source: &vaipb.ImportDataConfig_GcsSource{GcsSource: &vaipb.GcsSource{Uris: []string{fmt.Sprintf("gs://%s/%s", obj.BucketName(), obj.ObjectName())}}},
            // BELOW IS THE PROBLEM: WHAT'S THE CORRECT SCHEMA?
            ImportSchemaUri: "l",
        }},
}
var importDataOp *vai.ImportDataOperation
if importDataOp, err = datasetClient.ImportData(ctx, importDataReq); err != nil {
    return err
}
// Error!

Expected behavior

I should be able to import the tabular data into the dataset.

Actual behavior

If I leave the ImportSchemaUri attribute empty, I receive this message

Required field is not set.\t\nerror details: name = BadRequest field = name import_configs[0].import_schema_uri desc = Invalid Dataset resource name. Required field is not set.

If I put a wrong schema (like one of the other ioformat that's not related to tabular data, I receive the message

rpc error: code = InvalidArgument desc = List of found errors:\t1.Field: name; Message: Invalid Dataset resource name.\t\nerror details: name = BadRequest field = name desc = Invalid Dataset resource name.

Anyway, I do really thing the problem is in the missing schema for the tabular data. Is this correct? Is there a workaround for this?

Thanks!

galeone commented 1 year ago

Update: I understood that for tabular data the schema is the csv (or bigquery) structure itself. And that's fine. So there's no need to execute the ImportDataRequest.

However, the association between the storage and the dataset should be performed.

Reading the python source code I found this line:

https://github.com/googleapis/python-aiplatform/blob/6f3b34b39824717e7a995ca1f279230b41491f15/google/cloud/aiplatform/datasets/_datasources.py#L89

        if gcs_source:
            dataset_metadata = {"inputConfig": {"gcsSource": {"uri": gcs_source}}}
        elif bq_source:
            dataset_metadata = {"inputConfig": {"bigquerySource": {"uri": bq_source}}}

In practice, it's the equivalent of setting the *vaipb.Dataset.MetaData field to the equivalent Go struct.

I tried in several ways, with 0 luck.

First attempt: the struct value:

metadata, err := structpb.NewStruct(map[string]interface{}{
        "inputConfig": map[string]interface{}{
                "gcsSource": map[string]interface{}{
                        "uri": gcsSource,
                },
        },
}

req := &vaipb.CreateDatasetRequest{
        // Required. The resource name of the Location to create the Dataset in.
        // Format: `projects/{project}/locations/{location}`
        Parent: fmt.Sprintf("projects/%s/locations/%s", _vaiProjectID, _vaiLocation),
        Dataset: &vaipb.Dataset{
                DisplayName: userDataset,
                Description: fmt.Sprintf("User %d data", user.ID),
                // No metadata schema because it's a tabular dataset, and "tabular dataset does not support data import"
                // ref: https://github.com/googleapis/python-aiplatform/blob/6f3b34b39824717e7a995ca1f279230b41491f15/google/cloud/aiplatform/datasets/_datasources.py#LL223C30-L223C75
                MetadataSchemaUri: "gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml",
                // But we need to pass the metadata as a structpb.Value
                // https://github.com/googleapis/python-aiplatform/blob/6f3b34b39824717e7a995ca1f279230b41491f15/google/cloud/aiplatform/datasets/_datasources.py#L48
                Metadata: structpb.NewStructValue(metadata),
        },
}

But the server returns:

rpc error: code = InvalidArgument desc = Invalid structured dataset value: [struct_value {
  fields {
    key: "inputConfig"
    value {
      struct_value {
        fields {
          key: "gcsSource"
          value {
            struct_value {
              fields {
                key: "uri"
                value {
                  string_value: "gs://train-and-deploy-experiment-user-data/1-2023-05-20_2023-01-13.csv"
                }
              }
            }
          }
        }
      }
    }
  }
}

Second attempt: passing a string containing the JSON:

// same request with the Metadata field set in this way
Metadata: structpb.NewStrintValue(fmt.Sprintf(`{"inputConfig": {"gcsSource": {"uri": "%s"}}}`, gcsSource)),

In this case, the creation works, but in the Vertex AI page of the dataset, there's no association between the file on the bucket and the dataset.

I can do this process manually from the webpage and the association is possible. However, I need to do this from code in Go.

So, yes, this is definitely a google-cloud-go issue now

noahdietz commented 1 year ago

Routed to AI Platform product specialist for triage. Thanks @dizcology

dizcology commented 1 year ago

@galeone Thanks for reporting the issue.

Tabular datasets do not seem to support data import, and the association must be made at dataset creation time. The Python example (https://github.com/googleapis/python-aiplatform/blob/main/samples/snippets/dataset_service/create_dataset_tabular_gcs_sample.py#L34) specifies the gcs_uri in the dataset's metadata field.

UPDATE: Looking at your later comments, I see that you have already done that, but the issue is "in the Vertex AI page of the dataset, there's no association between the file on the bucket and the dataset." For good measure I just ran the Python sample with a public dataset gcs_uri='gs://cloud-samples-data/ai-platform/iris/iris_data.csv', and was able to see the gcs_uri in the console page for the newly created dataset: Screenshot 2023-05-22 at 3 57 31 PM

Could you share some more information about the missing association? What do you see when you navigate to the dataset in the console UI?

galeone commented 1 year ago

Hi @dizcology , thank you for the quick feedback.

After digging inside the Python codebase, I discovered the problem: I was using the wrong case for the Medatata :sweat_smile:

So, the correct way is:

err = metadata.UnmarshalJSON([]byte(fmt.Sprintf(`{"input_config": {"gcs_source": {"uri": ["%s"]}}}`, csvURI)))

using this case (input_config instead of inputConfig, and the same for gcs_source ,...) the association is available and visible in the panel.

However, a new problem is present: the next step is to Export the data to a training pipeline (right?).

This is what I do:

        // 4. Export the dataset to a training pipeline
        // ref: https://pkg.go.dev/cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb#ExportDataRequest

        gcsDestination := fmt.Sprintf("gs://%s/%d/", bucketName, user.ID)
        exportDataReq := &vaipb.ExportDataRequest{
            // ref: https://pkg.go.dev/cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb#ExportDataRequest
            // Required. The name of the Dataset resource.
            // Format:
            // `projects/{project}/locations/{location}/datasets/{dataset}`
            // NOTE: the last parameter is the dataset ID and not the dataset display name!
            Name: fmt.Sprintf("projects/%s/locations/%s/datasets/%s", _vaiProjectID, _vaiLocation, datasetId),
            ExportConfig: &vaipb.ExportDataConfig{
                Destination: &vaipb.ExportDataConfig_GcsDestination{
                    GcsDestination: &vaipb.GcsDestination{
                        OutputUriPrefix: gcsDestination,
                    },
                },
                Split: &vaipb.ExportDataConfig_FractionSplit{
                    FractionSplit: &vaipb.ExportFractionSplit{
                        TrainingFraction:   0.8,
                        ValidationFraction: 0.1,
                        TestFraction:       0.1,
                    },
                },
            },
        }

        var op *vai.ExportDataOperation
        if op, err = datasetClient.ExportData(ctx, exportDataReq); err != nil {
            return err
        }
        if _, err = op.Wait(ctx); err != nil {
            return err
        } else {
            fmt.Println("Export data operation finished")
        }

Unfortunately, op.Wait returns an error that's too generic to be helpful:

rpc error: code = Internal desc = INTERNAL

Do you have any guesses on how to move further? What am I doing wrong?

Thank you again for the support

noahdietz commented 1 year ago

returns an error that's too generic to be helpful:

Have you tried inspecting the error details? https://pkg.go.dev/cloud.google.com/go#hdr-Inspecting_errors

They may or may not be present in the google.rpc.Status populated in the Operation.error field that is returned to you from Wait.

galeone commented 1 year ago

HI @noahdietz , thanks for your feedback.

Unfortunately, even if I use the package you suggested in this way

        var op *vai.ExportDataOperation
        if op, err = datasetClient.ExportData(ctx, exportDataReq); err != nil {
            return err
        }
        if _, err = op.Wait(ctx); err != nil {
            if s, ok := status.FromError(err); ok {
                log.Println(s.Message())
                for _, d := range s.Proto().Details {
                    log.Println(d)
                }
            }
            return err
        } else {
            fmt.Println("Export data operation finished")
        }

The only output I have the log.Println(s.Message()) is

2023/05/24 19:56:29 INTERNAL

and s.Proto().Details is empty, so no additional information is available.

I'm completely stuck now.

Edit: the Go client is weird.

If I remove the Split from the ExportConfig (thus, I suppose to use the default splits) the error changes and becomes visible

2023/05/24 20:02:33 TABLE type Dataset (1064343834149, 2704833788701048832) does not support data export.

However, this sounds strange once again, since from the vertex AI panel (the web page) I'm able to train an AutoML model using this dataset, and the training requires the data to be exported, isn't it?

dizcology commented 1 year ago

@galeone : the next step for training is not exporting the data (which refers to copying the dataset that is already in Vertex AI's DatasetService to another location, such as Cloud Storage).

Once you have a dataset in the DatasetService, you can train a model in the UI or API. This documentation page contains some information: https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/train-model

galeone commented 1 year ago

Hi @dizcology thank you for your support. Step by step I'm moving forward, but there's something that blocking me also in this stage.

In short: I'm creating a TrainingPipeline using Go, in this way:


        // Create the Training Task Inputs
        // Info gathered from the REST API: https://cloud.google.com/vertex-ai/docs/training/automl-api?hl=it#regression
        var trainingTaskInput structpb.Struct
        // reference: https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1/schema/trainingjob.definition#automltablesinputs

        // Create the transformations for all the columns (required)
        var transformations string
        tot := len(csvHeaders(allUserData)) - 1
        for i, header := range csvHeaders(allUserData) {
            if header == targetColumn {
                // required because with auto the pipeline fails with error message:
                // "The values in target column SleepEfficiency have to be numeric for regression model."
                transformations += fmt.Sprintf(`{"numeric": {"column_name": "%s"}}`, header)
            } else {
                transformations += fmt.Sprintf(`{"auto": {"column_name": "%s"}}`, header)
            }
            if i < tot {
                transformations += ","
            }
        }

        err = trainingTaskInput.UnmarshalJSON([]byte(
            fmt.Sprintf(
                `{
                    "targetColumn": "%s",
                    "predictionType": "regression",
                    "trainBudgetMilliNodeHours": "1000",
                    "optimizationObjective": "minimize-rmse",
                    "transformations": [%s]
                }`, targetColumn, transformations)))
        if err != nil {
            return err
        }
        // use https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1/schema/trainingjob.definition#google.cloud.aiplatform.v1.schema.trainingjob.definition.AutoMlTablesInputs.Transformation

        if trainingPipeline, err = pipelineClient.CreateTrainingPipeline(ctx, &vaipb.CreateTrainingPipelineRequest{
            // Required. The resource name of the Location to create the TrainingPipeline
            // in. Format: `projects/{project}/locations/{location}`
            Parent: fmt.Sprintf("projects/%s/locations/%s", _vaiProjectID, _vaiLocation),
            TrainingPipeline: &vaipb.TrainingPipeline{
                DisplayName:            modelDisplayName,
                TrainingTaskDefinition: "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_tables_1.0.0.yaml",
                InputDataConfig: &vaipb.InputDataConfig{
                    DatasetId: datasetId,
                },
                TrainingTaskInputs: structpb.NewStructValue(&trainingTaskInput),
            },
        }); err != nil {
            if s, ok := status.FromError(err); ok {
                log.Println(s.Message())
                for _, d := range s.Proto().Details {
                    log.Println(d)
                }
            }
            return err
        }

As you can see, I added the transformations for all the data to auto except for the targetColumn that's numeric. I did this because the job failed with the error message

The values in target column SleepEfficiency have to be numeric for regression model.

(SleepEfficiency is the target column).

However, even after adding this transformation to the target column, I had no luck: I still receive the very same error.

The csv data Is correct, here's a row of the csv (and I can confirm that all the data in the csv is identical, same data types, no missing values).

ID,Date,NumberOfAutoDetectedActivities,NumberOfTrackerActivities,ActiveDurationSum,ActiveZoneMinutesSum,MinutesInCardioZoneSum,MinutesInFatBurnZoneSum,MinutesInPeakZoneSum,MinutesInOutOfZoneSum,ActivitiesNameConcatenation,CaloriesSum,DistanceSum,DurationSum,ElevationGainSum,StepsSum,AveragePace,AverageSpeed,AverageHeartRate,ActivityCalories,BMI,BodyFat,BodyWeight,CaloriesBMR,Calories,Distance,Floors,MinutesFairlyActive,MinutesLightlyActive,MinutesSedentary,MinutesVeryActive,Steps,RestingHeartRate,OutOfRangeMinutes,OutOfRangeCalories,OutOfRangeMaxBPM,OutOfRangeMinBPM,FatBurnMinutes,FatBurnCalories,FatBurnMaxBPM,FatBurnMinBPM,CardioMinutes,CardioCalories,CardioMaxBPM,CardioMinBPM,PeakMinutes,PeakCalories,PeakMaxBPM,PeakMinBPM,Elevation,SkinTemperature,CoreTemperature,AvgOxygenSaturation,MaxOxygenSaturation,MinOxygenSaturation,Vo2MaxLowerBound,Vo2MaxUpperBound,DailyRmssd,DeepRmssd,SleepDuration,SleepEfficiency,MinutesAfterWakeup,MinutesAsleep,MinutesAwake,MinutesToFallAsleep,TimeInBed,LightSleepMinutes,LightSleepCount,DeepSleepMinutes,DeepSleepCount,RemSleepMinutes,RemSleepCount,WakeSleepMinutes,WakeSleepCount
0,2023-06-03T00:00:00Z,0,1,3602000,0,14,43,0,0,Weights,569,0.00,3604000,0,1015,0.00,0.00,127.00,1889.00,22.49,20.00,74.50,1739.00,3283.00,6.09,0.00,18.00,301.00,577.00,59.00,8406.00,53,1379,2711.17,107,30,46,411.46,135,107,15,160.62,169,135,0,0.00,220,169,0.00,0.10,,93.70,95.90,90.80,52.00,56.00,46.88,36.79,29100000,43,0,417,68,0,485,244,47,102,4,71,5,68,47

The error is clear, but the solution is not. The sleep efficiency column is numeric and the numeric transform is also applied. Anyway, the train fails.

If you'd be so kind to help me also with this it would be great.

Thank you very much

codyoss commented 1 year ago

@dizcology nudge