For tabular regression/classification, use all features by default?

neubig commented 1 year ago

Currently classification and regression over tabular data (extracted features) are supported through the tabular-regression and tabular-classification tasks. However in the processor for these tasks, they use basically no input features for analysis by default.

Because of this, any features that you want to analyze need to be declared as custom features in a JSON file.

It'd be nice to make this process as easy as possible. Here is an example for a front-end interface we could aim for, similar to a combination of

The only additional thing that would need to be implemented would be theexplainaboard_client.wrap_tabular_data function.

# Import libraries and classes required for this example:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd 
import explainaboard_client

# Import dataset:
url = “iris.csv”

# Assign column names to dataset:
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Convert dataset to a pandas dataframe:
dataset = pd.read_csv(url, names=names) 

# Use head() function to return the first 5 rows: 
dataset.head() 
# Assign values to the X and y variables:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values 

# Split dataset into random train and test subsets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) 

# Standardize features by removing mean and scaling to unit variance:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) 

# Use the KNN classifier to fit data:
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train) 

# Predict y data with classifier: 
y_predict = classifier.predict(X_test)

wrapped_test = explainaboard_client.wrap_tabular_dataset(
    X_test,
    y_test,
    column_names=names[:-1],
    columns_to_analyze=['sepal-length', 'sepal-width', 'petal-length', 'petal-width'],
)

# Do the evaluation
evaluation_result = client.evaluate_system(
    task='text-classification',
    system_name='text-classification-test',
    system_output=y_test,
    custom_dataset=wrapped_test,
    split='test',
    source_language='en',
)

# Print the results
print(f'Successfully submitted system!\n'
      f'Name: {evaluation_result["system_name"]}\n'
      f'ID: {evaluation_result["system_id"]}')
results = evaluation_result['results']['example'].items()
for metric_name, value in results:
    print(f'{metric_name}: {value:.4f}')

noelchen90 commented 1 year ago

@neubig In the example above, X_test and y_test are type numpy.ndarray. For the explainaboard_client.wrap_tabular_data function, was the input expected to be a numpy.ndarray or a pandas.Dataframe?

In the custom feature JSON example, most custom features are categorical, which wouldn't work for numpy.ndarray. So I am thinking, instead of passing in X_test and y_test, maybe we should let the users pass in dataset (type pandas.Dataframe) instead. What do you think?

wrapped_dataset = explainaboard_client.wrap_tabular_dataset(
    dataset
    columns_to_analyze=['sepal-length', 'sepal-width', 'petal-length', 'petal-width'],
)

neubig commented 1 year ago

Yep, that sounds great, thanks @noelchen90 !

neubig commented 1 year ago

Should be fixed by https://github.com/neulab/explainaboard_client/commit/d576898edb85d81887c7e8844bb20ffc03ff5f7f

neulab / explainaboard_client

For tabular regression/classification, use all features by default? #56