marcotcr / checklist

Beyond Accuracy: Behavioral Testing of NLP models with CheckList
MIT License
2.01k stars 204 forks source link

IndexError when running test #103

Closed xesaad closed 3 years ago

xesaad commented 3 years ago

Hello 👋🏼 firstly, thank you very much for making CheckList! It is very useful and well-documented.

I am receiving an IndexError when I try to run a test (following the tutorial on testing) using either the run or run_from_file methods. As it is essentially the same traceback in either case, here is the traceback when I try to run a test from a file:

IndexError                                Traceback (most recent call last)
/tmp/ipykernel_736/1947319734.py in <module>
----> 1 test.run_from_file('/tmp/softmax_preds.txt', file_format='softmax', overwrite=True)

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/abstract_test.py in run_from_file(self, path, file_format, format_fn, ignore_header, overwrite)
    341                                  format_fn=format_fn,
    342                                  ignore_header=ignore_header)
--> 343         self.run_from_preds_confs(preds, confs, overwrite=overwrite)
    344 
    345 

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/abstract_test.py in run_from_preds_confs(self, preds, confs, overwrite)
    310         self._check_create_results(overwrite)
    311         self.update_results_from_preds(preds, confs)
--> 312         self.update_expect()
    313 
    314     def run_from_file(self, path, file_format=None, format_fn=None, ignore_header=False, overwrite=False):

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/abstract_test.py in update_expect(self)
    129     def update_expect(self):
    130         self._check_results()
--> 131         self.results.expect_results = self.expect(self)
    132         self.results.passed = Expect.aggregate(self.results.expect_results, self.agg_fn)
    133 

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/expect.py in expect(self)
     76         def expect(self):
     77             zipped = iter_with_optional(self.data, self.results.preds, self.results.confs, self.labels, self.meta, self.run_idxs)
---> 78             return [fn(x, pred, confs, labels, meta) for x, pred, confs, labels, meta in zipped]
     79         return expect
     80 

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/expect.py in <listcomp>(.0)
     76         def expect(self):
     77             zipped = iter_with_optional(self.data, self.results.preds, self.results.confs, self.labels, self.meta, self.run_idxs)
---> 78             return [fn(x, pred, confs, labels, meta) for x, pred, confs, labels, meta in zipped]
     79         return expect
     80 

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/expect.py in expect_fn(xs, preds, confs, label, meta)
     96         """
     97         def expect_fn(xs, preds, confs, label=None, meta=None):
---> 98             return np.array([fn(x, p, c, l, m) for x, p, c,  l, m in iter_with_optional(xs, preds, confs, label, meta)])
     99         return Expect.testcase(expect_fn)#, agg_fn)
    100 

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/expect.py in <listcomp>(.0)
     96         """
     97         def expect_fn(xs, preds, confs, label=None, meta=None):
---> 98             return np.array([fn(x, p, c, l, m) for x, p, c,  l, m in iter_with_optional(xs, preds, confs, label, meta)])
     99         return Expect.testcase(expect_fn)#, agg_fn)
    100 

/pyenv/versions/3.7.8/envs/seo-advice-page/lib/python3.7/site-packages/checklist/expect.py in ret_fn(x, pred, conf, label, meta)
    407             gt = val if val is not None else label
    408             softmax = type(conf) in [np.array, np.ndarray]
--> 409             conf = conf[gt] if softmax else -conf
    410             conf_viol = -(1 - conf)
    411             if pred == gt:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If it is of use, here are some additional details: the model I am testing is an aspect-based sentiment classifier, and my test-cases are of the form 'This is not {a:pos} {mask}.|{aspect}' where aspect is selected from a short list of aspects. My wrapper function for the test then splits the string at the character | and runs the aspect-based prediction model. I can provide more details if required.

The error was not raised when I ran the example in the tutorial verbatim, so I expect this is related to my wrapper function or test cases.

Environment: Python 3.7.8 CheckList installed via pip (so version 0.0.11)

marcotcr commented 3 years ago

Can you give an example of input / output of your prediction function?

xesaad commented 3 years ago

Can you give an example of input / output of your prediction function?

Example inputs (to the wrapper function) are strings consisting of a review and an "aspect" separated by |. For example, "this is a very comfortable chair.|comfort". The wrapper function outputs a pair (sentiment, certainty) of float values, where sentiment is a score between 0 and 1 (with 0 being negative and 1 being positive) and certainty is a float between 0 and 1 indicating how confident the model is with this prediction.

(Under the hood, the predicton wrapper splits the string into a pair ("this is a very comfortable chair.", "comfort") which is then passed to a model that computes the sentiment expressed in the text towards the aspect, as well as the certainty.)

With the wrapper function named prediction_wrapper, I followed the tutorial and defined the following function which is used to run tests (rescaling as the sentiment function used in the tutorial takes values between -1 and 1):

def predict_proba(inputs):
    """
    Returns an array with probabilities for negative and positive.
    """
    p1 = np.array([prediction_wrapper(x)[0] for x in inputs]).reshape(-1, 1)
    p0 = 1- p1
    return np.hstack((p0, p1))

A few test cases generated by CheckList are below.

'This is not a good screen.|design'
'This is not an exciting analysis.|value'
'This is not an awful translation.|quality'
xesaad commented 3 years ago

I resolved this issue. I made an error when defining ret: namely, I set labels="positive" (resp. "negative") when should have set labels=1 (resp. 0). This is because my model outputs a label (one of "positive", "negative"or "neutral") and I had to modify it in order to get this prediction as a float, but forgot to revert the labels.

Closing this issue as resolved.