NoSchemaFoundInWildMode: recipe-scrapers exception: No Recipe Schema found at None.

hhursev / recipe-scrapers

Python package for scraping recipes data

MIT License

1.62k stars 508 forks source link

NoSchemaFoundInWildMode: recipe-scrapers exception: No Recipe Schema found at None. #1007

Closed Bardo-Konrad closed 4 months ago

Bardo-Konrad commented 4 months ago

The issue is that I converted cookbooks in pdf to html and read them using

ort = "cookbook.html"
f = open(ort,"r",encoding="UTF-8")
html = f.read()
f.close()

scraper = scrape_html(html=html)
print(scraper)

And I get the following

Pre-filing checks

[ x] I have searched for open issues that report the same problem
[ x] I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

None

...

The results you expect to see

Recipes

The results (including any Python error messages) that you are seeing scraper = scrape_html(html=html) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Logs\Anaconda\envs\p311\Lib\site-packages\recipe_scrapers__init__.py", line 695, in scrape_html raise NoSchemaFoundInWildMode(org_url) recipe_scrapers._exceptions.NoSchemaFoundInWildMode: recipe-scrapers exception: No Recipe Schema found at None.

jayaddison commented 4 months ago

Sorry; this library can only scrape a single recipe at a time from the HTML retrieved from public recipe websites.

Bardo-Konrad commented 4 months ago

So if I separated the recipes it could work?

jayaddison commented 4 months ago

It's unlikely to work with generic PDF-to-HTML conversion software, but in theory it is possible, if the conversion software produces HTML that contains valid schema.org recipe metadata.

Bardo-Konrad commented 4 months ago

It's unlikely to work with generic PDF-to-HTML conversion software, but in theory it is possible, if the conversion software produces HTML that contains valid schema.org recipe metadata.

That is not the case, for sure. Isn't there a regex part that detects recipes and extracts them in a robust way in the code?

jayaddison commented 4 months ago

There's no generic regex for that, no - the typical workflow is that the domain-name from the recipe's URL is matched against our available scrapers, and if one is found, then it implements logic for each of the different fields (title, ingredients, prep_time, ...). Fortunately, many of them can look that data up directly in the schema.org metadata -- and if not, they can navigate the HTML to find the relevant information.

Bardo-Konrad commented 4 months ago

and if not, they can navigate the HTML to find the relevant information.

How?

jayaddison commented 4 months ago

By using the BeautifulSoup Python library, in most cases.

If you'd like to learn more, I'd recommend inspecting some of the source code - here's the current code for the WikiCookbook scraper, for example: https://github.com/hhursev/recipe-scrapers/blob/9f4317ca0b7fd741b5d1723b446b961e05e596cf/recipe_scrapers/wikicookbook.py