Closed Bardo-Konrad closed 4 months ago
Sorry; this library can only scrape a single recipe at a time from the HTML retrieved from public recipe websites.
So if I separated the recipes it could work?
It's unlikely to work with generic PDF-to-HTML conversion software, but in theory it is possible, if the conversion software produces HTML that contains valid schema.org
recipe metadata.
It's unlikely to work with generic PDF-to-HTML conversion software, but in theory it is possible, if the conversion software produces HTML that contains valid
schema.org
recipe metadata.
That is not the case, for sure. Isn't there a regex part that detects recipes and extracts them in a robust way in the code?
There's no generic regex for that, no - the typical workflow is that the domain-name from the recipe's URL is matched against our available scrapers, and if one is found, then it implements logic for each of the different fields (title
, ingredients
, prep_time
, ...). Fortunately, many of them can look that data up directly in the schema.org
metadata -- and if not, they can navigate the HTML to find the relevant information.
and if not, they can navigate the HTML to find the relevant information.
How?
By using the BeautifulSoup
Python library, in most cases.
If you'd like to learn more, I'd recommend inspecting some of the source code - here's the current code for the WikiCookbook
scraper, for example: https://github.com/hhursev/recipe-scrapers/blob/9f4317ca0b7fd741b5d1723b446b961e05e596cf/recipe_scrapers/wikicookbook.py
The issue is that I converted cookbooks in pdf to html and read them using
And I get the following
Pre-filing checks
The URL of the recipe(s) that are not being scraped correctly
None
...
The results you expect to see
Recipes
The results (including any Python error messages) that you are seeing scraper = scrape_html(html=html) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Logs\Anaconda\envs\p311\Lib\site-packages\recipe_scrapers__init__.py", line 695, in scrape_html raise NoSchemaFoundInWildMode(org_url) recipe_scrapers._exceptions.NoSchemaFoundInWildMode: recipe-scrapers exception: No Recipe Schema found at None.