Open galeone opened 1 year ago
Update: I understood that for tabular data the schema is the csv (or bigquery) structure itself. And that's fine. So there's no need to execute the ImportDataRequest.
However, the association between the storage and the dataset should be performed.
Reading the python source code I found this line:
if gcs_source:
dataset_metadata = {"inputConfig": {"gcsSource": {"uri": gcs_source}}}
elif bq_source:
dataset_metadata = {"inputConfig": {"bigquerySource": {"uri": bq_source}}}
In practice, it's the equivalent of setting the *vaipb.Dataset.MetaData
field to the equivalent Go struct.
I tried in several ways, with 0 luck.
First attempt: the struct value:
metadata, err := structpb.NewStruct(map[string]interface{}{
"inputConfig": map[string]interface{}{
"gcsSource": map[string]interface{}{
"uri": gcsSource,
},
},
}
req := &vaipb.CreateDatasetRequest{
// Required. The resource name of the Location to create the Dataset in.
// Format: `projects/{project}/locations/{location}`
Parent: fmt.Sprintf("projects/%s/locations/%s", _vaiProjectID, _vaiLocation),
Dataset: &vaipb.Dataset{
DisplayName: userDataset,
Description: fmt.Sprintf("User %d data", user.ID),
// No metadata schema because it's a tabular dataset, and "tabular dataset does not support data import"
// ref: https://github.com/googleapis/python-aiplatform/blob/6f3b34b39824717e7a995ca1f279230b41491f15/google/cloud/aiplatform/datasets/_datasources.py#LL223C30-L223C75
MetadataSchemaUri: "gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml",
// But we need to pass the metadata as a structpb.Value
// https://github.com/googleapis/python-aiplatform/blob/6f3b34b39824717e7a995ca1f279230b41491f15/google/cloud/aiplatform/datasets/_datasources.py#L48
Metadata: structpb.NewStructValue(metadata),
},
}
But the server returns:
rpc error: code = InvalidArgument desc = Invalid structured dataset value: [struct_value {
fields {
key: "inputConfig"
value {
struct_value {
fields {
key: "gcsSource"
value {
struct_value {
fields {
key: "uri"
value {
string_value: "gs://train-and-deploy-experiment-user-data/1-2023-05-20_2023-01-13.csv"
}
}
}
}
}
}
}
}
}
Second attempt: passing a string containing the JSON:
// same request with the Metadata field set in this way
Metadata: structpb.NewStrintValue(fmt.Sprintf(`{"inputConfig": {"gcsSource": {"uri": "%s"}}}`, gcsSource)),
In this case, the creation works, but in the Vertex AI page of the dataset, there's no association between the file on the bucket and the dataset.
I can do this process manually from the webpage and the association is possible. However, I need to do this from code in Go.
So, yes, this is definitely a google-cloud-go issue now
Routed to AI Platform product specialist for triage. Thanks @dizcology
@galeone Thanks for reporting the issue.
Tabular datasets do not seem to support data import, and the association must be made at dataset creation time. The Python example (https://github.com/googleapis/python-aiplatform/blob/main/samples/snippets/dataset_service/create_dataset_tabular_gcs_sample.py#L34) specifies the gcs_uri
in the dataset's metadata
field.
UPDATE: Looking at your later comments, I see that you have already done that, but the issue is "in the Vertex AI page of the dataset, there's no association between the file on the bucket and the dataset." For good measure I just ran the Python sample with a public dataset gcs_uri='gs://cloud-samples-data/ai-platform/iris/iris_data.csv'
, and was able to see the gcs_uri
in the console page for the newly created dataset:
Could you share some more information about the missing association? What do you see when you navigate to the dataset in the console UI?
Hi @dizcology , thank you for the quick feedback.
After digging inside the Python codebase, I discovered the problem: I was using the wrong case for the Medatata :sweat_smile:
So, the correct way is:
err = metadata.UnmarshalJSON([]byte(fmt.Sprintf(`{"input_config": {"gcs_source": {"uri": ["%s"]}}}`, csvURI)))
using this case (input_config
instead of inputConfig
, and the same for gcs_source
,...) the association is available and visible in the panel.
However, a new problem is present: the next step is to Export the data to a training pipeline (right?).
This is what I do:
// 4. Export the dataset to a training pipeline
// ref: https://pkg.go.dev/cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb#ExportDataRequest
gcsDestination := fmt.Sprintf("gs://%s/%d/", bucketName, user.ID)
exportDataReq := &vaipb.ExportDataRequest{
// ref: https://pkg.go.dev/cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb#ExportDataRequest
// Required. The name of the Dataset resource.
// Format:
// `projects/{project}/locations/{location}/datasets/{dataset}`
// NOTE: the last parameter is the dataset ID and not the dataset display name!
Name: fmt.Sprintf("projects/%s/locations/%s/datasets/%s", _vaiProjectID, _vaiLocation, datasetId),
ExportConfig: &vaipb.ExportDataConfig{
Destination: &vaipb.ExportDataConfig_GcsDestination{
GcsDestination: &vaipb.GcsDestination{
OutputUriPrefix: gcsDestination,
},
},
Split: &vaipb.ExportDataConfig_FractionSplit{
FractionSplit: &vaipb.ExportFractionSplit{
TrainingFraction: 0.8,
ValidationFraction: 0.1,
TestFraction: 0.1,
},
},
},
}
var op *vai.ExportDataOperation
if op, err = datasetClient.ExportData(ctx, exportDataReq); err != nil {
return err
}
if _, err = op.Wait(ctx); err != nil {
return err
} else {
fmt.Println("Export data operation finished")
}
Unfortunately, op.Wait
returns an error that's too generic to be helpful:
rpc error: code = Internal desc = INTERNAL
Do you have any guesses on how to move further? What am I doing wrong?
Thank you again for the support
returns an error that's too generic to be helpful:
Have you tried inspecting the error details? https://pkg.go.dev/cloud.google.com/go#hdr-Inspecting_errors
They may or may not be present in the google.rpc.Status
populated in the Operation.error
field that is returned to you from Wait
.
HI @noahdietz , thanks for your feedback.
Unfortunately, even if I use the package you suggested in this way
var op *vai.ExportDataOperation
if op, err = datasetClient.ExportData(ctx, exportDataReq); err != nil {
return err
}
if _, err = op.Wait(ctx); err != nil {
if s, ok := status.FromError(err); ok {
log.Println(s.Message())
for _, d := range s.Proto().Details {
log.Println(d)
}
}
return err
} else {
fmt.Println("Export data operation finished")
}
The only output I have the log.Println(s.Message())
is
2023/05/24 19:56:29 INTERNAL
and s.Proto().Details
is empty, so no additional information is available.
I'm completely stuck now.
Edit: the Go client is weird.
If I remove the Split
from the ExportConfig
(thus, I suppose to use the default splits) the error changes and becomes visible
2023/05/24 20:02:33 TABLE type Dataset (1064343834149, 2704833788701048832) does not support data export.
However, this sounds strange once again, since from the vertex AI panel (the web page) I'm able to train an AutoML model using this dataset, and the training requires the data to be exported, isn't it?
@galeone : the next step for training is not exporting the data (which refers to copying the dataset that is already in Vertex AI's DatasetService to another location, such as Cloud Storage).
Once you have a dataset in the DatasetService, you can train a model in the UI or API. This documentation page contains some information: https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/train-model
Hi @dizcology thank you for your support. Step by step I'm moving forward, but there's something that blocking me also in this stage.
In short: I'm creating a TrainingPipeline using Go, in this way:
// Create the Training Task Inputs
// Info gathered from the REST API: https://cloud.google.com/vertex-ai/docs/training/automl-api?hl=it#regression
var trainingTaskInput structpb.Struct
// reference: https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1/schema/trainingjob.definition#automltablesinputs
// Create the transformations for all the columns (required)
var transformations string
tot := len(csvHeaders(allUserData)) - 1
for i, header := range csvHeaders(allUserData) {
if header == targetColumn {
// required because with auto the pipeline fails with error message:
// "The values in target column SleepEfficiency have to be numeric for regression model."
transformations += fmt.Sprintf(`{"numeric": {"column_name": "%s"}}`, header)
} else {
transformations += fmt.Sprintf(`{"auto": {"column_name": "%s"}}`, header)
}
if i < tot {
transformations += ","
}
}
err = trainingTaskInput.UnmarshalJSON([]byte(
fmt.Sprintf(
`{
"targetColumn": "%s",
"predictionType": "regression",
"trainBudgetMilliNodeHours": "1000",
"optimizationObjective": "minimize-rmse",
"transformations": [%s]
}`, targetColumn, transformations)))
if err != nil {
return err
}
// use https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1/schema/trainingjob.definition#google.cloud.aiplatform.v1.schema.trainingjob.definition.AutoMlTablesInputs.Transformation
if trainingPipeline, err = pipelineClient.CreateTrainingPipeline(ctx, &vaipb.CreateTrainingPipelineRequest{
// Required. The resource name of the Location to create the TrainingPipeline
// in. Format: `projects/{project}/locations/{location}`
Parent: fmt.Sprintf("projects/%s/locations/%s", _vaiProjectID, _vaiLocation),
TrainingPipeline: &vaipb.TrainingPipeline{
DisplayName: modelDisplayName,
TrainingTaskDefinition: "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_tables_1.0.0.yaml",
InputDataConfig: &vaipb.InputDataConfig{
DatasetId: datasetId,
},
TrainingTaskInputs: structpb.NewStructValue(&trainingTaskInput),
},
}); err != nil {
if s, ok := status.FromError(err); ok {
log.Println(s.Message())
for _, d := range s.Proto().Details {
log.Println(d)
}
}
return err
}
As you can see, I added the transformations for all the data to auto
except for the targetColumn
that's numeric.
I did this because the job failed with the error message
The values in target column SleepEfficiency have to be numeric for regression model.
(SleepEfficiency is the target column).
However, even after adding this transformation to the target column, I had no luck: I still receive the very same error.
The csv data Is correct, here's a row of the csv (and I can confirm that all the data in the csv is identical, same data types, no missing values).
ID,Date,NumberOfAutoDetectedActivities,NumberOfTrackerActivities,ActiveDurationSum,ActiveZoneMinutesSum,MinutesInCardioZoneSum,MinutesInFatBurnZoneSum,MinutesInPeakZoneSum,MinutesInOutOfZoneSum,ActivitiesNameConcatenation,CaloriesSum,DistanceSum,DurationSum,ElevationGainSum,StepsSum,AveragePace,AverageSpeed,AverageHeartRate,ActivityCalories,BMI,BodyFat,BodyWeight,CaloriesBMR,Calories,Distance,Floors,MinutesFairlyActive,MinutesLightlyActive,MinutesSedentary,MinutesVeryActive,Steps,RestingHeartRate,OutOfRangeMinutes,OutOfRangeCalories,OutOfRangeMaxBPM,OutOfRangeMinBPM,FatBurnMinutes,FatBurnCalories,FatBurnMaxBPM,FatBurnMinBPM,CardioMinutes,CardioCalories,CardioMaxBPM,CardioMinBPM,PeakMinutes,PeakCalories,PeakMaxBPM,PeakMinBPM,Elevation,SkinTemperature,CoreTemperature,AvgOxygenSaturation,MaxOxygenSaturation,MinOxygenSaturation,Vo2MaxLowerBound,Vo2MaxUpperBound,DailyRmssd,DeepRmssd,SleepDuration,SleepEfficiency,MinutesAfterWakeup,MinutesAsleep,MinutesAwake,MinutesToFallAsleep,TimeInBed,LightSleepMinutes,LightSleepCount,DeepSleepMinutes,DeepSleepCount,RemSleepMinutes,RemSleepCount,WakeSleepMinutes,WakeSleepCount
0,2023-06-03T00:00:00Z,0,1,3602000,0,14,43,0,0,Weights,569,0.00,3604000,0,1015,0.00,0.00,127.00,1889.00,22.49,20.00,74.50,1739.00,3283.00,6.09,0.00,18.00,301.00,577.00,59.00,8406.00,53,1379,2711.17,107,30,46,411.46,135,107,15,160.62,169,135,0,0.00,220,169,0.00,0.10,,93.70,95.90,90.80,52.00,56.00,46.88,36.79,29100000,43,0,417,68,0,485,244,47,102,4,71,5,68,47
The error is clear, but the solution is not. The sleep efficiency column is numeric and the numeric transform is also applied. Anyway, the train fails.
If you'd be so kind to help me also with this it would be great.
Thank you very much
@dizcology nudge
I have created the empty dataset (using the Go SDK, but maybe the language doesn't matter), and the next step is to associate the data source with the dataset. When creating the empty dataset, I specified the "metadata schema URI" as
gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml
. The empty dataset has been created with no problem, and it appeared on the vertex AI dashboard.The data is already uploaded to a bucket on Google Storage. So far so good. However the "import data request" requires the attribute "Import Schema URI" to have a valid value.
Unfortunately, all the available schemas are not for tabular data AFAIK. When listing the content of the bucket
gs://google-cloud-aiplatform/schema/dataset/ioformat
in fact, there's nothing about tabular data.Am I doing something wrong or is this schema for the IO of tabular data missing?
e.g.
Expected behavior
I should be able to import the tabular data into the dataset.
Actual behavior
If I leave the
ImportSchemaUri
attribute empty, I receive this messageIf I put a wrong schema (like one of the other ioformat that's not related to tabular data, I receive the message
Anyway, I do really thing the problem is in the missing schema for the tabular data. Is this correct? Is there a workaround for this?
Thanks!