I've registered my data as a FileDataset per instructions mentioned in the 01_Data_Preparation notebook. However, when I call the get_many_models_train_steps method an exception is thrown on step input type of <class 'azureml.data.file_dataset.FileDataset'>.
There doesn't appear to be documentation stating that I should register the dataset as any of the classes listed in the ALLOWED_INPUT_TYPES such as azureml.pipeline.core.pipeline_output_dataset.PipelineOutputFileDataset in the exception.
Below is the code that I'm using in an AzureML notebook along with the exception that gets raised. Is there any intermediate processing that I'm missing where the dataset would be converted to one of the allowed input types?
Any direction is greatly appreciated. Thanks!
#keep azureml-core updated to the latest version
!pip install --upgrade azureml-core
#Install the azureml-contrib-automl-pipeline-steps package that is needed for many models
!pip install azureml.contrib.automl.pipeline.steps
#dependencies
import logging
import os
import random
import time
import json
from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from datetime import datetime
import time
import azureml.core
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.datastore import Datastore
from azureml.core.dataset import Dataset
from azureml.core import Workspace
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.contrib.automl.pipeline.steps import AutoMLPipelineBuilder
from azureml.automl.core.forecasting_parameters import ForecastingParameters
#initialize workspace and remote compute
ws = Workspace.create(name = workspace_name,
subscription_id = subscription_id,
resource_group = resource_group,
location = workspace_region,
exist_ok=True)
amlcompute_cluster_name = "cluster-{}".format(ws._workspace_id)[:10]
# Verify that cluster does not exist already
try:
compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',min_nodes=2,max_nodes=12)
compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)
#set up experiment
experiment_name = 'automl-manymodels-salesforecast'
experiment = Experiment(ws, experiment_name)
#sourcing data as FileDataset
ds = ws.get_default_datastore()
train_datastore_paths = [(ds, 'azureml/ForecastsInput/train/')]
test_datastore_paths = [(ds, 'azureml/ForecastsInput/test/')]
train_sales_data = Dataset.File.from_files(path=train_datastore_paths)
test_sales_data = Dataset.File.from_files(path=test_datastore_paths)
smpl_view = pd.read_csv(train_sales_data.download()[0])
smpl_view.head(5) #this works as expected
#configuration parameters for forecasting task
target_column_name = 'revenue'
time_column_name = 'month_date'
time_series_id_column_names = 'partner'
forecast_horizon = 3
freq='MS' #MonthBegin per pandas offset
ts_models = ['Naive', 'AutoArima', 'SeasonalAverage', 'SeasonalNaive', 'ExponentialSmoothing', 'Arimax', 'Average','Prophet']
automl_settings = {
"task" : 'forecasting',
"primary_metric" : 'normalized_root_mean_squared_error',
"allowed_models": ts_models,
"iteration_timeout_minutes" : 10, # This needs to be changed based on the dataset
"iterations" : 15,
"experiment_timeout_hours" : 0.3,
"label_column_name" : target_column_name,
"n_cross_validations" : 3,
"verbosity" : logging.INFO,
"debug_log": 'autoML_manyModels.txt',
"time_column_name": time_column_name,
"forecast_horizon" : forecast_horizon,
"freq": freq,
"track_child_runs": False,
"partition_column_names": time_series_id_column_names,
"time_series_id_column_names": time_series_id_column_names,
"pipeline_fetch_max_batch_size": 15
}
#AutoMLPipelineBuilder is used to build the many models train step
train_steps = AutoMLPipelineBuilder.get_many_models_train_steps(experiment=experiment,
automl_settings=automl_settings,
train_data=train_sales_data,
compute_target=compute_target,
node_count=2,
process_count_per_node=8,
run_invocation_timeout=3700,
partition_column_names = time_series_id_column_names,
output_datastore=ds)
Also note that I've tried using a TabularDataset, but the below attribute error gets thrown:
281 # TODO: Merge these two in better fashion once tabular dataset is released to public.
282 if(dataset_type == "<class 'azureml.data.tabular_dataset.TabularDataset'>"):
--> 283 parallel_run_config = ParallelRunConfig.create_with_partition_column_names(
284 source_directory=PROJECT_DIR,
285 entry_script='many_models_train_driver.py',
AttributeError: type object 'ParallelRunConfig' has no attribute 'create_with_partition_column_names'
Hi,
I've been following the AutoML Training Pipeline notebook for direction on implementing a many models forecasting solution.
I've registered my data as a FileDataset per instructions mentioned in the 01_Data_Preparation notebook. However, when I call the get_many_models_train_steps method an exception is thrown on step input type of <class 'azureml.data.file_dataset.FileDataset'>.
There doesn't appear to be documentation stating that I should register the dataset as any of the classes listed in the ALLOWED_INPUT_TYPES such as azureml.pipeline.core.pipeline_output_dataset.PipelineOutputFileDataset in the exception.
Below is the code that I'm using in an AzureML notebook along with the exception that gets raised. Is there any intermediate processing that I'm missing where the dataset would be converted to one of the allowed input types?
Any direction is greatly appreciated. Thanks!
Step Input Exception
Also note that I've tried using a TabularDataset, but the below attribute error gets thrown: