hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

Language: consistency: retrieve dynamically from HTML where possible. #1112

Closed jayaddison closed 6 days ago

jayaddison commented 2 months ago

...and raise an ElementNotFoundInHtml exception to try to help us resolve cases where it isn't possible.

cc @rmdluo @mlduff because we've been discussing how to handle and emit warning messages.

jknndy commented 1 month ago

This returns a lengthy output during a run of python -m unittest , is this the expected behavior?

(.venv) % python -m unittest
....................................../recipe_scrapers/bestrecipes.py:19: UserWarning: bestrecipes.com.au doesn't seem to provide language metadata in their HTML. Please let us know if it becomes available in a standard location, and then we can try to retrieve it dynamically.
  warnings.warn(msg)
................................................................................................................................................................................................................................./recipe_scrapers/potatorolls.py:52: UserWarning: potatorolls.com appears to write language metadata into HTML DOCTYPE declaration instead of the top-level 'html' element or elsewhere. Please let us know if it becomes available in a standard location, and then we can try to retrieve it dynamically.
  warnings.warn(msg)
........................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 559 tests in 12.732s

OK
jayaddison commented 1 month ago

It's kinda-expected that it causes noise in the unittest output, yep. I'd wondered about that in another pull request, #1067, when discussing options for logging warnings for statically-defined fields (fields that only ever return a single value).

I had an idea and have pushed a commit (ff411e14e067fed809b18a2fa9be20622c3e174f) to implement it: we could ignore warnings from the scraper fields by default, but allow unit tests to be run with warning filters enabled, and add failures in the latter case when warnings occur.

Two drawbacks to that, though: it only works for Python3.11+, and also it means that the warnings are entirely hidden (no hints/reminders about them at all) when no Python warning options are configured.

jayaddison commented 1 month ago

Defaults

$ python -m unittest -k potato 
..
----------------------------------------------------------------------
Ran 2 tests in 0.172s

OK

Warnings enabled

$ python -Walways -m unittest -k potato 
FF
======================================================================
FAIL: tests/test_data/potatorolls.com/potatorolls_1.json (tests.RecipeTestCase.tests/test_data/potatorolls.com/potatorolls_1.json) [language]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jka/Documents/reciperadar/recipe-scrapers/tests/__init__.py", line 110, in test_func
    self.fail(msg=msg)
AssertionError: .language() emitted warnings: UserWarning

======================================================================
FAIL: tests/test_data/potatorolls.com/potatorolls_2.json (tests.RecipeTestCase.tests/test_data/potatorolls.com/potatorolls_2.json) [language]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jka/Documents/reciperadar/recipe-scrapers/tests/__init__.py", line 110, in test_func
    self.fail(msg=msg)
AssertionError: .language() emitted warnings: UserWarning

----------------------------------------------------------------------
Ran 2 tests in 0.178s

FAILED (failures=2)
jayaddison commented 6 days ago

Although the warnings-related discussion is valuable, I think it's better to narrow this changeset down to only the one change that it claims to make, and so I've reverted my modifications to add warning-filtering/handling in the unit tests.

jayaddison commented 6 days ago

Instead of attempting to add clever self-adaptive warning filtering logic within the codebase itself, I think that following common usage patterns for Python warning filtering would make more sense.

Typically the -W argument to the Python interpreter, and/or the PYTHONWARNINGS environment variable are provided for that purpose.

In particular, I think that something along the lines of PYTHONWARNINGS='ignore::Warning:recipe_scrapers.plugins.static_values' environment variable would be an example of using standard Python functionality to filter out only the static-value-related warnings when our unit tests run.