Closed ivsanro1 closed 2 years ago
hey @ivsanro1! It seems in most cases the support is 1. What do you think about considering those as test cases instead? It seems these examples are very similar to the existing test cases we have (https://github.com/scrapinghub/price-parser/blob/master/tests/test_price_parsing.py), but with more variety for currency.
My main worry about these things is the workflow. If we go with an evaluation approach, we should figure out how contributors should be using it. Probably we should run it on CI. But it's less clear to me what to do with the results.
What do you think about considering those as test cases instead? It seems these examples are very similar to the existing test cases we have
Hey @kmike , thanks for the suggestion! I also thought about this, but how we'd handle the failing cases? Should we add them in a different test and mark is as xfail? I think it would be nice to easily add cases that we know we fail until they're fixed.
If we go with an evaluation approach, we should figure out how contributors should be using it
Good point. I also think putting it in CI would be the best option if we decide to go with the evaluation and not move it to tests. My original idea was to use it merely as info, or just check that the global accuracy didn't go down with every contribution - but it might be too overkill. However, I like more the tests idea, and maybe I'll just add the failing cases in a different test marked as xfail until they get fixed. What do you think?
However, I like more the tests idea, and maybe I'll just add the failing cases in a different test marked as xfail until they get fixed. What do you think?
This sounds good! That's exactly how the current testing suite is working. In the current testing suite each example is a separate test (via pytest.mark.parametrize); the ones which are failing are marked as xfail.
The existing testing suite was actually created in a similar way - a few hundred example input strings from a dataset were annotated and converted to tests. Then, there was a round of improving the parsing rules, and after that another set of examples was added, to control the overfitting (mostly to get an impression, that's not very scientific).
Unlike standard ML problems, I'm not too worried about overfitting here. I.e. it seems to be fine to fix xfail tests one-by-one, as in most cases it'd just mean adding more aliases for currencies. It still could be good to do a second round afterwards, e.g. to confirm that the accuracy improves after the first round of fixes.
By the way, in the current testing suite there is a lot of overlapping or similar examples; that's because many websites show prices in a similar way. So, it's more like a dataset, not like a hand-crafted testing suite, with a few exceptions. This allows to make trade-off decisions: it could be the case that a rule may fix some examples and break others; we can use the test count to help us decide which way to go in such situations.
That's exactly how the current testing suite is working. In the current testing suite each example is a separate test (via pytest.mark.parametrize); the ones which are failing are marked as xfail.
I see! Thanks for the explanation :+1: Then, what I'll do instead of adding this evaluation script is to add all the examples as tests, and those that fail add them in PRICE_PARSING_EXAMPLES_XFAIL
until they're fixed. I like more this method since it's easier to document in the tests each of the cases.
Please feel free to close the PR if you consider it appropiate
You can also implement the new approach in this PR if you want, instead of creating a separate one.
Those were the last tests I wanted to add for this PR. I was also thinking to use this PR to implement these cases and then move the tests from xfail to pass, but it might take some time. Should we merge it and add the cases to price-parser in another PR @kmike @Gallaecio ?
Should we merge it and add the cases to price-parser in another PR?
Sounds good to me.
Before merging, could you please ensure that all tests pass @ivsanro1? I think some of them fail currently.
Before merging, could you please ensure that all tests pass @ivsanro1?
Yes, of course! I made the changes in 7529b2e52ceec51fe5d98b65ab8223cfcfad25be
These mypy errors were showing:
price_parser/_currencies.py:30: error: syntax error in type comment
price_parser/_currencies.py:1731: error: syntax error in type comment
price_parser/_currencies.py:1732: error: syntax error in type comment
price_parser/_currencies.py:1733: error: syntax error in type comment
I didn't see any particular problems with these lines. I saw we could update mypy
from ver 0.761
to 0.982
, and after updating it, it showed a couple of type errors in the tests I added (now fixed).
Also, I added the -v
flag to mypy in tox.ini
so it would report the errors (it wasn't showing them before)
Now everything is passing:
Success: no issues found in 5 source files
_________________________________________________________________________ summary _________________________________________________________________________
py36: commands succeeded
py37: commands succeeded
py38: commands succeeded
mypy: commands succeeded
congratulations :)
It may make sense to look into addressing https://github.com/scrapinghub/price-parser/issues/52 before merging here.
Thanks @ivsanro1!
Hello,
Since we might add some features for fuzzy search, especially when the currency is not found (see https://github.com/scrapinghub/price-parser/issues/28#issuecomment-1274568097), we think it'd be nice to have some evaluation script and dataset that we can use to compare the quality of the extractions (thanks for the idea @lopuhin !).
The purpose of the evaluation is to compare the currency extraction quality of
Price.fromstring
. If the functionality of this function is extended via PR (e.g. we perform a fuzzy search if the currency is not found), we should be able to see an improvement in the metrics.I decided not to add it as a test because the evaluation metrics can be interpreted in several ways.
For now, the metrics are simply global accuracy and accuracy per symbol (ground truth) for quality evaluation, and total extraction time and extraction time per sample for performance evaluation. These metrics can be easily extended, as the script is quite simple (
evaluation/evaluate_currency_extraction.py
).If in the future we add other evaluation metrics or datasets, this script can be improved and generalized (e.g. adding
argparse
). For now I decided to keep it simple.The dataset (
dataset_eval.json
) can also be easily extended to include more cases.Here's a sample of the
dataset_eval.json
:The evaluation script will extract the currency from
string
usingPrice.fromstring
and compare it withcurrency
. A correct extraction happens when both are equal.With the current dataset and price_parser versions, this is the output of the evaluation:
I'm open to suggestions