neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
359 stars 36 forks source link

add third party code for text-to-sql evaluation #532

Closed shuaichenchang closed 1 year ago

shuaichenchang commented 1 year ago

Add third-party code for text-to-sql evaluation. The code is adapted from https://github.com/taoyds/test-suite-sql-eval

shuaichenchang commented 1 year ago

Modified the Readme and renamed files sql_evaluation.py => evaluation.py, sql_evaluation_test.py => evaluation_test.py to match the code in the third-party repo. We used the different file names for some historical reasons now we changed them back to be the same as the files in the original repo.

shuaichenchang commented 1 year ago

Change the test criteria in meta_eval_wmt_da_test.py to make it pass the test. Use criteria ci[0]<val<ci[1] instead of hardcode the values of ci[0], ci[1]. e.g.

self.assertAlmostEqual(ci[0], 0.6488, 2) 
 self.assertAlmostEqual(ci[1], 0.8999, 2)  

to

self.assertGreater(val, ci[0]) 
self.assertGreater(ci[1], val) 
neubig commented 1 year ago

Thanks @shuaichenchang ! Could we handle https://github.com/neulab/ExplainaBoard/pull/532/commits/3d5dfeca91c46e6e4faceb0b8a2d18154c7923ad as a separate PR, as it's not related to this addition of third-party code?

pfliu-nlp commented 1 year ago

Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code?

Maybe approve this case and keep the habit in mind in future PR.

neubig commented 1 year ago

I'm OK with discussing as part of this thread this time. But I'm actually not sure if the changes to the tests are appropriate, and was hoping to have discussion in the separate thread.

These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?

If the code is not broken and it's just the case that the confidence intervals vary run-to-run due to some variance in bootstrapping, then I think it would be better to modify the tests so that they cover a "reasonable" range (e.g. so that they'll fail only a tiny fraction of the time) but not loosen them so much so that they're basically not checking the underlying code.

pfliu-nlp commented 1 year ago

"These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?"

Yes, @shuaichenchang has consulted me offline; the current modified implementation is something I suggested.

"not loosen them so much so that they're basically not checking the underlying code."

I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.

pfliu-nlp commented 1 year ago

Thanks for the suggestion, which looks good.

"Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future"

Yeah, I cannot think of other potential reasons (but just from my side) I think having a simple solution first + plus creating an issue or TODO, is a good idea sometimes! Let's go with it.

Additionally, another context I think is: this task is relatively challenging to add. I kinda tend to try to reduce the blockers so that @shuaichenchang will feel relatively easy to add them. If there are some places that could be further improved, we can leave them as issues, then I, for example, could refine them later.

neubig commented 1 year ago

Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: https://github.com/neulab/ExplainaBoard/issues/537

A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.

pfliu-nlp commented 1 year ago

Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537

A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.

One observation is: it could pass the test in python3.9 while failed in python3.10.

odashi commented 1 year ago

@neubig @pfliu-nlp Since the issue was created, we need to focus on the original change in this PR.

@shuaichenchang Could you re-request reviews to @neubig and @odashi from the top-right side of this page? it encourages me to continue a review as fast as possible.

shuaichenchang commented 1 year ago

It seems the ignoring third-party in .markdownlint.yaml doesn't skip the tests in the third-party folder. @tetsuok https://github.com/neulab/ExplainaBoard/blob/main/.markdownlint.yaml#L27

tetsuok commented 1 year ago

@shuaichenchang It seems the configuration file needs to be updated. Let me take a look. I'll fix in a separate PR.

tetsuok commented 1 year ago

@shuaichenchang The fix was merged into main. You can merge main into this branch to fix markdownlint error.

tetsuok commented 1 year ago

@shuaichenchang @neubig @pfliu-nlp I created a separate PR #550 to fix integration_tests/meta_eval_wmt_da_test.py as soon as possible because I'm observing more frequent test failures in other pull requests, which becomes a drain on our productivity.

pfliu-nlp commented 1 year ago

I tested this branch locally, it could pass all tests; it's strange that there is a markdown link error here, maybe need to re-commit? (cc @tetsuok )

black....................................................................Passed
flake8...................................................................Passed
docstring................................................................Passed
isort....................................................................Passed
Upgrade type hints.......................................................Passed
mypy.....................................................................Passed
markdownlint-cli2........................................................Passed
markdownlint-cli2-fix....................................................Passed
tetsuok commented 1 year ago

@pfliu-nlp It's because this branch is behind main. We need to merge main into this branch and push the commit to the remote.

pfliu-nlp commented 1 year ago

@tetsuok I see (I assumed that @shuaichenchang has merged the latest main into this PR, but it seemed he didn't), let me do it.