hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

Update SchemaOrg Fill Plugin? #1101

Open mkierzenka opened 2 months ago

mkierzenka commented 2 months ago

Hi all, I noticed a couple things about the SchemaOrg filler plugin (recipe_scrapers/plugins/schemaorg_fill.py) when I was experimenting with the code for my recent HalfBakedHarvest scraper ticket.

Example code:

hbh = HalfBakedHarvest("https://www.halfbakedharvest.com/banana-bread-mug-cake/")
schm = hbh.schema
schm.reviews()
schm.prep_time()

Through trial and error I saw:

Not having worked with SchemaOrg before, I am not quite sure if the available methods of the schema object will change based on the particular recipe/scraper used... but would expect no because it seems to be a standardized interface? Please let me know if I am wrong about that. And if that's the case, perhaps SchemaOrgFillPlugin.run_on_methods should not have the two missing methods and include the others that are available above?

Please let me know if I am thinking about this the right way / what you think. Happy to make the changes myself if they are deemed correct.

jayaddison commented 2 months ago

Yes, I think you're figuring this out correctly @mkierzenka - thank you for investigating.

As documented, the schema.org fill plugin is intended to allow recipe fields that are not implemented -- the default for many of them in the AbstractScraper class -- to be populated when corresponding schema.org Recipe metadata is found in the page HTML. It does this by re-routing unimplemented method calls from the scraper to the method of the same name in the SchemaOrg class.

However: the plugin declares a set of fields to decide when to do that re-routing, and that can become out-of-sync with the schema.org fields that we support in SchemaOrg.

I think we should be cautious about enabling automatic rerouting of all those fields - I can't immediately think of what could go wrong, but any change like this can have unanticipated side-effects.

cc @jknndy who has been considering a few schema-related fields recently and may have some more thoughts/ideas.

jayaddison commented 2 months ago

@mkierzenka the run_on_methods collection has been updated to include a few more fields, particularly by #1065 - that's not released yet, but could be relevant for you?

mkierzenka commented 1 month ago

@jayaddison Yup, that one adds 4 of the fields I mentioned- thanks for pointing it out! In regards to my 1st bullet point, don't see a "reviews" nor "links" field on the schema.org site (there is "review" which claims to supersede "reviews", though, if that is relevant)

But understand if there is hesitancy to change this due to potential side-effects

jayaddison commented 1 month ago

Ok, yep - links is not strictly a schema.org field -- and we don't include an implementation for it in our SchemaOrg class. However, we do implement it at the top-level AbstractScraper. That means the naming of SchemaOrgFillPlugin could be confusing in the context of links.

Because the reviews field doesn't have an implementation at all yet, we should either remove that, or implement it. I'd suggest that we keep it plural; our interface doesn't have to match the naming of the upstream schema.org spec (indeed we diverge in a few places already).