Open mkierzenka opened 7 months ago
Yes, I think you're figuring this out correctly @mkierzenka - thank you for investigating.
As documented, the schema.org
fill plugin is intended to allow recipe fields that are not implemented -- the default for many of them in the AbstractScraper
class -- to be populated when corresponding schema.org
Recipe
metadata is found in the page HTML. It does this by re-routing unimplemented method calls from the scraper to the method of the same name in the SchemaOrg
class.
However: the plugin declares a set of fields to decide when to do that re-routing, and that can become out-of-sync with the schema.org
fields that we support in SchemaOrg
.
I think we should be cautious about enabling automatic rerouting of all those fields - I can't immediately think of what could go wrong, but any change like this can have unanticipated side-effects.
cc @jknndy who has been considering a few schema-related fields recently and may have some more thoughts/ideas.
@mkierzenka the run_on_methods
collection has been updated to include a few more fields, particularly by #1065 - that's not released yet, but could be relevant for you?
@jayaddison Yup, that one adds 4 of the fields I mentioned- thanks for pointing it out! In regards to my 1st bullet point, don't see a "reviews" nor "links" field on the schema.org site (there is "review" which claims to supersede "reviews", though, if that is relevant)
But understand if there is hesitancy to change this due to potential side-effects
Ok, yep - links
is not strictly a schema.org
field -- and we don't include an implementation for it in our SchemaOrg
class. However, we do implement it at the top-level AbstractScraper
. That means the naming of SchemaOrgFillPlugin
could be confusing in the context of links
.
Because the reviews
field doesn't have an implementation at all yet, we should either remove that, or implement it. I'd suggest that we keep it plural; our interface doesn't have to match the naming of the upstream schema.org
spec (indeed we diverge in a few places already).
Hi all, I noticed a couple things about the SchemaOrg filler plugin (
recipe_scrapers/plugins/schemaorg_fill.py
) when I was experimenting with the code for my recent HalfBakedHarvest scraper ticket.Example code:
Through trial and error I saw:
schm
does not have these two methods:reviews
,links
(there is some related discussion linked in #870)schm
has more methods than those it is currently allowed to override, i.e. can extend therun_on_methods
tuple withcook_time
,cuisine
,description
,format
,people
,prep_time
, andratingsdata
. Or a subset of these.Not having worked with SchemaOrg before, I am not quite sure if the available methods of the schema object will change based on the particular recipe/scraper used... but would expect no because it seems to be a standardized interface? Please let me know if I am wrong about that. And if that's the case, perhaps
SchemaOrgFillPlugin.run_on_methods
should not have the two missing methods and include the others that are available above?Please let me know if I am thinking about this the right way / what you think. Happy to make the changes myself if they are deemed correct.