microsoft / solution-accelerator-many-models

MIT License
192 stars 87 forks source link

AutoMLPipelineBuilder: Exception on step input type of <class 'azureml.data.file_dataset.FileDataset'> #127

Closed petrenkw closed 3 years ago

petrenkw commented 3 years ago

Hi,

I've been following the AutoML Training Pipeline notebook for direction on implementing a many models forecasting solution.

I've registered my data as a FileDataset per instructions mentioned in the 01_Data_Preparation notebook. However, when I call the get_many_models_train_steps method an exception is thrown on step input type of <class 'azureml.data.file_dataset.FileDataset'>.

There doesn't appear to be documentation stating that I should register the dataset as any of the classes listed in the ALLOWED_INPUT_TYPES such as azureml.pipeline.core.pipeline_output_dataset.PipelineOutputFileDataset in the exception.

Below is the code that I'm using in an AzureML notebook along with the exception that gets raised. Is there any intermediate processing that I'm missing where the dataset would be converted to one of the allowed input types?

Any direction is greatly appreciated. Thanks!

#keep azureml-core updated to the latest version
!pip install --upgrade azureml-core
#Install the azureml-contrib-automl-pipeline-steps package that is needed for many models
!pip install azureml.contrib.automl.pipeline.steps

#dependencies
import logging
import os
import random
import time
import json

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd
from datetime import datetime
import time

import azureml.core
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.datastore import Datastore
from azureml.core.dataset import Dataset
from azureml.core import Workspace
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
from azureml.contrib.automl.pipeline.steps import AutoMLPipelineBuilder
from azureml.automl.core.forecasting_parameters import ForecastingParameters

#initialize workspace and remote compute

ws = Workspace.create(name = workspace_name,
                      subscription_id = subscription_id,
                      resource_group = resource_group, 
                      location = workspace_region,
                      exist_ok=True)

amlcompute_cluster_name = "cluster-{}".format(ws._workspace_id)[:10]
# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',min_nodes=2,max_nodes=12)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

#set up experiment
experiment_name = 'automl-manymodels-salesforecast'
experiment = Experiment(ws, experiment_name)

#sourcing data as FileDataset
ds = ws.get_default_datastore()
train_datastore_paths = [(ds, 'azureml/ForecastsInput/train/')]
test_datastore_paths = [(ds, 'azureml/ForecastsInput/test/')]
train_sales_data = Dataset.File.from_files(path=train_datastore_paths)
test_sales_data = Dataset.File.from_files(path=test_datastore_paths)

smpl_view = pd.read_csv(train_sales_data.download()[0])
smpl_view.head(5) #this works as expected

#configuration parameters for forecasting task 
target_column_name = 'revenue'
time_column_name = 'month_date'
time_series_id_column_names = 'partner'
forecast_horizon = 3
freq='MS' #MonthBegin per pandas offset
ts_models = ['Naive', 'AutoArima', 'SeasonalAverage', 'SeasonalNaive', 'ExponentialSmoothing', 'Arimax', 'Average','Prophet']

automl_settings = {
    "task" : 'forecasting',
    "primary_metric" : 'normalized_root_mean_squared_error',
    "allowed_models": ts_models,
    "iteration_timeout_minutes" : 10, # This needs to be changed based on the dataset
    "iterations" : 15,
    "experiment_timeout_hours" : 0.3,
    "label_column_name" : target_column_name,
    "n_cross_validations" : 3,
    "verbosity" : logging.INFO, 
    "debug_log": 'autoML_manyModels.txt',
    "time_column_name": time_column_name,
    "forecast_horizon" : forecast_horizon,
    "freq": freq,
    "track_child_runs": False,
    "partition_column_names": time_series_id_column_names,
    "time_series_id_column_names": time_series_id_column_names,
    "pipeline_fetch_max_batch_size": 15
}

#AutoMLPipelineBuilder is used to build the many models train step
train_steps = AutoMLPipelineBuilder.get_many_models_train_steps(experiment=experiment,
                                                                automl_settings=automl_settings,
                                                                train_data=train_sales_data,
                                                                compute_target=compute_target,
                                                                node_count=2,
                                                                process_count_per_node=8,
                                                                run_invocation_timeout=3700,
                                                                partition_column_names = time_series_id_column_names,
                                                                output_datastore=ds)

Step Input Exception

Also note that I've tried using a TabularDataset, but the below attribute error gets thrown:

    281         # TODO: Merge these two in better fashion once tabular dataset is released to public.
    282         if(dataset_type == "<class 'azureml.data.tabular_dataset.TabularDataset'>"):
--> 283             parallel_run_config = ParallelRunConfig.create_with_partition_column_names(
    284                 source_directory=PROJECT_DIR,
    285                 entry_script='many_models_train_driver.py',

AttributeError: type object 'ParallelRunConfig' has no attribute 'create_with_partition_column_names'
naveenkaushik2504 commented 3 years ago

I'm facing a similar issue. Is there a fix to it?