mitmedialab / sherlock-project

This repository provides data and scripts to use Sherlock, a DL-based model for semantic data type detection: https://sherlock.media.mit.edu.
https://sherlock.media.mit.edu
MIT License
148 stars 69 forks source link

KeyError when running model.predict(X_test) in 02-1-train-and-test-sherlock.ipynb #48

Open KentonParton opened 2 years ago

KentonParton commented 2 years ago

Hello!

I am trying to use the pre-built 'sherlock' model to make predictions. As suggested in the readme, I have run some of the cells in the 02-1-train-and-test-sherlock.ipynb file but get a KeyError when model.predict(X_test) is run.

Code to Reproduce:

model_id = 'sherlock'

from ast import literal_eval
from collections import Counter
from datetime import datetime

import numpy as np
import pandas as pd

from sklearn.metrics import f1_score, classification_report

from sherlock.deploy.model import SherlockModel

start = datetime.now()
print(f'Started at {start}')

X_test = pd.read_parquet('../data/processed/X_test.parquet')
y_test = pd.read_parquet('../data/raw/test_labels.parquet').values.flatten()

y_test = np.array([x.lower() for x in y_test])

print(f'Finished at {datetime.now()}, took {datetime.now() - start} seconds')

start = datetime.now()
print(f'Started at {start}')

model = SherlockModel();
model.initialize_model_from_json(with_weights=True, model_id="sherlock");

print('Initialized model.')
print(f'Finished at {datetime.now()}, took {datetime.now() - start} seconds')

predicted_labels = model.predict(X_test)
predicted_labels = np.array([x.lower() for x in predicted_labels])

When model.predict(X_test) is run the following KeyError occurs:

KeyError                                  Traceback (most recent call last)
/var/folders/66/cbb21km104n7d7t9qf61q8rmrsjdc8/T/ipykernel_21846/2316637303.py in <module>
----> 1 predicted_labels = model.predict(X_test)
      2 predicted_labels = np.array([x.lower() for x in predicted_labels])

~/ebsco_repos/sherlock-project/sherlock/deploy/model.py in predict(self, X, model_id)
    118         Array with predictions for X.
    119         """
--> 120         y_pred = self.predict_proba(X, model_id)
    121         y_pred_classes = helpers._proba_to_classes(y_pred, model_id)
    122 

~/ebsco_repos/sherlock-project/sherlock/deploy/model.py in predict_proba(self, X, model_id)
    141         y_pred = self.model.predict(
    142             [
--> 143                 X[feature_cols_dict["char"]].values,
    144                 X[feature_cols_dict["word"]].values,
    145                 X[feature_cols_dict["par"]].values,

KeyError: "['n_[^]-agg-sum', 'n_[^]-agg-max', 'n_[\\\\]-agg-kurtosis', 'n_[^]-agg-var', 'n_[\\\\]-agg-median', 'n_[^]-agg-kurtosis', 'n_[\\\\]-agg-mean', 'n_[\\\\]-agg-all', 'n_[^]-agg-min', 'n_[\\\\]-agg-sum', 'n_[^]-agg-median', 'n_[^]-agg-mean', 'n_[^]-agg-all', 'n_[\\\\]-agg-min', 'n_[\\\\]-agg-max', 'n_[^]-agg-any', 'n_[\\\\]-agg-var', 'n_[\\\\]-agg-any', 'n_[^]-agg-skewness', 'n_[\\\\]-agg-skewness'] not in index"

Is there something that I am missing or need to do prior to running the above code?

Appreciate the help!

KentonParton commented 2 years ago

@lowecg @madelonhulsebos would you mind providing some guidance, please?

lowecg commented 2 years ago

Hi Kenton,

Sorry for the delay - I missed your original post. I'll have a look at this in the morning.

To get a lay of the land:

It sounds like you've initialised the project and just run 02-1-train-and-test-sherlock.ipynb? Was that all you ran?

Could you confirm what version of Python you're running?

Cheers,

Chris.

madelonhulsebos commented 2 years ago

Hi @KentonParton,

Apologies for my late response but I plan to take a look at this tomorrow.

@lowecg, I believe I’ve encountered this issue before, but will let you know if the issue is unknown to me..

Best, Madelon

madelonhulsebos commented 2 years ago

Hi @KentonParton,

I ran your code and it works for me once I use the test data file that was created by running the notebook 01-data-processing.ipynb (this file is named test.parquet). Did you generate X_test.parquet with that as well? What does it contain? Its head should be as follows:

Screenshot 2022-04-23 at 10 29 09

If you just want to test the model with some custom input, I recommend using the notebook: 00-use-sherlock-out-of-the-box.ipynb.

madelonhulsebos commented 2 years ago

Hi @KentonParton, did you solve the issue?