Grouped instructions, APIs, and open schema (Gousto in v5)

hhursev / recipe-scrapers

Python package for scraping recipes data

MIT License

1.68k stars 520 forks source link

Is openschema intended to be a universal input? Otherwise, is this an downstream issue and should I be trying to see if Tandoor would accept recipes in JSON (avoiding the use of recipe-scrapers) instead of HTML?

Hi @tboby - thanks for the discussion.

You're probably aware of some/all of this, but as a recap/background:

We have three independent projects in this case, each with their own slightly different perspectives.

The schema.org standard (including the Recipe type), as you refer to as openschema (I haven't seen that name myself, but maybe it is called that elsewhere) was put forward by a few companies (Google, Microsoft, Yahoo, Yandex) -- it's a simple format that websites can easily adopt and that makes it easier for those companies and others to retrieve metadata of various kinds. It allows data on the web to be more structured and accessible - generally good things.
Tandoor - an open source recipe manager and meal planner application. I'm not hugely familiar with it myself, but they're one of the downstream consumers of recipe-scrapers that seems to have a large-ish known user base, so I try to keep them informed of updates like the v15 upgrade.
recipe-scrapers - this Python library; a way to retrieve recipe metadata from webpages -- initially HTML-only, but then much-expanded after adding schema.org and other metadata formats -- and until v15, supportive of multi-request (request an HTML page, then from there request a JSON API) scraping.

That's important background to answer your question:

Is openschema intended to be a universal input? Otherwise, is this an downstream issue and should I be trying to see if Tandoor would accept recipes in JSON (avoiding the use of recipe-scrapers) instead of HTML?

I can't really provide a useful answer about the ambitions of schema.org -- but it certainly meets the requirements for recipe-scrapers, and that simplifies our code a lot. In some ways it'd be nice if we also had parallel HTML retrieval for the recipe websites we support, because then we could verify that what is claimed in the schema.org matches what appears to users -- but that's not something we support today.

I think it's certainly valid to ask Tandoor whether they would consider retrieving from schema.org JSON directly. If I were them, then on the one hand I would be quite comfortable offloading the support/accuracy checks about recipes to a separate component (e.g. recipe-scrapers) - but there is also certainly a performance and dependency management angle to consider too (fewer dependencies is good) - and, maybe at a lower priority, there could potentially be a vendor-lockin concern (schema.org is good, but perhaps the ability to support sites that don't use it for whatever reason is also beneficial).

hhursev / recipe-scrapers

Grouped instructions, APIs, and open schema (Gousto in v5) #1225