Closed shuaichenchang closed 1 year ago
Modified the Readme and renamed files sql_evaluation.py => evaluation.py, sql_evaluation_test.py => evaluation_test.py to match the code in the third-party repo. We used the different file names for some historical reasons now we changed them back to be the same as the files in the original repo.
Change the test criteria in meta_eval_wmt_da_test.py to make it pass the test. Use criteria ci[0]<val<ci[1] instead of hardcode the values of ci[0], ci[1]. e.g.
self.assertAlmostEqual(ci[0], 0.6488, 2)
self.assertAlmostEqual(ci[1], 0.8999, 2)
to
self.assertGreater(val, ci[0])
self.assertGreater(ci[1], val)
Thanks @shuaichenchang ! Could we handle https://github.com/neulab/ExplainaBoard/pull/532/commits/3d5dfeca91c46e6e4faceb0b8a2d18154c7923ad as a separate PR, as it's not related to this addition of third-party code?
Thanks @shuaichenchang ! Could we handle 3d5dfec as a separate PR, as it's not related to this addition of third-party code?
Maybe approve this case and keep the habit in mind in future PR.
I'm OK with discussing as part of this thread this time. But I'm actually not sure if the changes to the tests are appropriate, and was hoping to have discussion in the separate thread.
These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?
If the code is not broken and it's just the case that the confidence intervals vary run-to-run due to some variance in bootstrapping, then I think it would be better to modify the tests so that they cover a "reasonable" range (e.g. so that they'll fail only a tiny fraction of the time) but not loosen them so much so that they're basically not checking the underlying code.
"These changes to the unit tests seem to be loosening the check on confidence intervals significantly. I'm wondering if it's appropriate to do so, or if the tests are failing because the code is broken and we need to fix the underlying code. @pfliu-nlp : I think this is code that you implemented, could you confirm that?"
Yes, @shuaichenchang has consulted me offline; the current modified implementation is something I suggested.
"not loosen them so much so that they're basically not checking the underlying code."
I think the current implementation is fine. In a lot of other places, we even don't check the ci at all.
Thanks for the suggestion, which looks good.
"Are you certain that the implementation is correct, and the failed tests here are just a result of variance in bootstrapping or something similar? If so, then I'd be OK with approving this PR, but suggest that we at least leave a TODO there so we know take a note of this discussion for the future"
Yeah, I cannot think of other potential reasons (but just from my side) I think having a simple solution first + plus creating an issue or TODO, is a good idea sometimes! Let's go with it.
Additionally, another context I think is: this task is relatively challenging to add. I kinda tend to try to reduce the blockers so that @shuaichenchang will feel relatively easy to add them. If there are some places that could be further improved, we can leave them as issues, then I, for example, could refine them later.
Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: https://github.com/neulab/ExplainaBoard/issues/537
A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.
Yeah, I understand the importance of being pragmatic. I created an issue to double-check this though: #537
A difference of 0.08 in the confidence values seems a bit bigger than I would expect from our bootstrapping so it'd be nice to know for sure that this is not a bug.
One observation is: it could pass the test in python3.9 while failed in python3.10.
@neubig @pfliu-nlp Since the issue was created, we need to focus on the original change in this PR.
@shuaichenchang Could you re-request reviews to @neubig and @odashi from the top-right side of this page? it encourages me to continue a review as fast as possible.
It seems the ignoring third-party in .markdownlint.yaml
doesn't skip the tests in the third-party folder. @tetsuok
https://github.com/neulab/ExplainaBoard/blob/main/.markdownlint.yaml#L27
@shuaichenchang It seems the configuration file needs to be updated. Let me take a look. I'll fix in a separate PR.
@shuaichenchang The fix was merged into main
. You can merge main
into this branch to fix markdownlint error.
@shuaichenchang @neubig @pfliu-nlp I created a separate PR #550 to fix integration_tests/meta_eval_wmt_da_test.py
as soon as possible because I'm observing more frequent test failures in other pull requests, which becomes a drain on our productivity.
I tested this branch locally, it could pass all tests; it's strange that there is a markdown link error here, maybe need to re-commit? (cc @tetsuok )
black....................................................................Passed
flake8...................................................................Passed
docstring................................................................Passed
isort....................................................................Passed
Upgrade type hints.......................................................Passed
mypy.....................................................................Passed
markdownlint-cli2........................................................Passed
markdownlint-cli2-fix....................................................Passed
@pfliu-nlp It's because this branch is behind main
. We need to merge main
into this branch and push the commit to the remote.
@tetsuok I see (I assumed that @shuaichenchang has merged the latest main into this PR, but it seemed he didn't), let me do it.
Add third-party code for text-to-sql evaluation. The code is adapted from https://github.com/taoyds/test-suite-sql-eval