hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.62k stars 508 forks source link

Ideas for developer experience improvements #617

Closed jayaddison closed 18 hours ago

jayaddison commented 1 year ago

There are a few places where we might be able to spruce up and improve the developer experience at the moment. Faster, more consistent, fewer configuration points, more straightforward workflows - that kind of stuff.

Some examples from recent experience:

jayaddison commented 1 year ago

Adding one more item: debugging plugin-related issues is tricky at the moment, and I think the reason is that it's not clear/transparent what plugins are enabled for each method call.

They're important because they catch and handle various forms of noisy web-based input for various method calls (title, ingredients, instructions in particular), but they're challenging to reason about and may produce unexpected behaviour at runtime.

Labelling every method with multiple decorators would look spammy, so I don't think that'd be a great approach. We should also bear in mind that plugins are developer-customizable -- callers may wish to opt-in or opt-out of various post-processing steps. And also we shouldn't introduce unnecessary overhead -- but nor should we allow bad/unsafe data on the basis of technical minimalism.

Taking HTML unescaping as an example: at the moment I think we HTML unescape twice for fields that we expect may contain HTML content. As far as I can tell, there's one particular site which incorrectly double-escapes HTML entities in their schema.org content, and that's why we do this. Perhaps the default should be to HTML unescape once, but allow decoration with html_unescape(rounds=2) or similar to override that per-site. And if a performance-sensitive caller felt very confident that a particular scraper required no unescaping, they could somehow override method calls to state html_unescape(rounds=0).

Basically I think that the default implementation should be "safe and correct" but with the opportunity to skip processing steps for power-users (by which I mean sophisticated power-data-users really -- sites with enough data to know where and when it makes sense to omit a step for performance reasons).

What this implies, I think, is a pipeline for each individual method. Perhaps taken to logical extremes it means that there is no "recipe scrapers", but instead a "recipe title scraper", "recipe instructions scraper", ... - and each is a defined pipeline for a {website, timerange, conditions} set. With the goal being to achieve near-100% (and accurate) coverage of known web content across all websites and timeranges.

I don't think we should attempt that yet - Python classes that wrap entire sites makes much more sense in terms of user convenience and developer attention (context-switching). But basically each method here is codifying "here's where we think that the remote data provider has put this particular item of information for a given URL", followed by "and here are the steps we wish to apply to convert that into something that we believe is safe and accurate for our users".

If we become good enough at that, we can "upstream" any errors and oddities back to the original source and basically create a positive feedback loop that reduces the number of exceptions required (while still ideally detecting them when they occur).

jayaddison commented 1 year ago

A few more notes for a developer guide:

And another idea: should we have a similar code review / maintainer guide?

bfcarpio commented 1 year ago

I'm going to add some random slightly thought out things:


@jayaddison What's the scraper that incorrectly double-escapes HTML? That might be a good individual issue.

jayaddison commented 1 year ago

Remove schema based scrapers that have no overloads

I mostly agree with this, with a few small notes:

Networking code moved to scrape_me

Thinking about scraping as a pipeline could help here. There's content retrieval (HTML/WARC/etc), then parsing, then extraction of metadata, and then mapping the metadata to output format(s) (python method calls, JSON, ...).

Scraper developers should only really have to care about the extraction part (and only when schema / microdata / foo doesn't already extract what they need). How we most easily represent that in Python classes I'm not sure. At the moment the retrieval and extraction steps are slightly bundled together.

Replace requests with urllib

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Use poetry

I'm 50/50 on that. I agree the version management and tracking could be useful. At the moment (after some lxml support issues in #629) I'm more in the mindset of removing dependencies where possible (maybe I should focus on removing the ones that I added, first..)

Switch to pyproject.toml

Agreed, makes sense. Note: we do have a MANIFEST.in file that I think is still used, let's try not to forget about that during migration.

jayaddison commented 1 year ago

What's the scraper that incorrectly double-escapes HTML?

Unfortunately I'm not sure I kept a note of that. From a quick search, I think a few sites are affected (seriouseats, homechef, food52, ...). It's typically in recipe-related element contents, not the entire page (a developer adding an extra htmlencode(...) for safety, for example.

bfcarpio commented 1 year ago

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

We do still need a hostname-to-scraper mapping somewhere

What for? If a scraper can entirely be handled via wild_mode then wouldn't the client use AbstractScraper? I've forgotten if that's the correct type that's returned.

At the moment the retrieval and extraction steps are slightly bundled together.

I would think of scrape_me as the default, easy to use, batteries included retrieval implementation. If someone wants to use async code, or have their input be rabbitmq then they can write their own handler and pass it to a dedicated scraper class. I'm assuming here that the data is the entire HTML content of the page so extraction can be as normal.

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Oh, then never mind. It's not that big of a deal.

jayaddison commented 1 year ago

Remove schema based scrapers that have no overloads

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

A hypothetical scenario: let's say that a schema-supported scraper becomes nothing more than a class with a host method.

In that case, I don't think we want new pull requests with template-generated scrapers that have a complete set of self.schema.* method calls.

(at the risk of overcomplicating things: perhaps ideally we'd want the generator to attempt to call each schema method, and only include methods for fields that were missing entirely from the input HTML schema and microdata)

We do still need a hostname-to-scraper mapping somewhere

What for?

A couple of reasons I can think of - potentially resolvable:

jayaddison commented 1 year ago

And after a bit more thought, two more reasons for keeping a hostname-to-scraper mapping:

(those are both fairly important use-cases, I think)

hhursev commented 1 year ago

How about we do the following in '22:


In a separate issue:

jayaddison commented 1 year ago

Sounds ambitious, and good :)

About the plugins: before removing them, would it make sense to move them onto the AbstractScraper as a default list? People could still override them as necessary per-scraper, or in third-party libraries by subclassing, and I think it'd result in clearer code in this library (as in, fewer lines to read to understand what they're doing).

About settings: we only have a small number of them, right?

hhursev commented 1 year ago

Seems like settings module should be removed 😄

Implementing plugins as a default list in AbstractScraper sounds nice to begin with - yep. Let's go down this path.

jayaddison commented 1 year ago

I'd like to introduce #650 (a migration to use tox), although I'm not totally sure what everyone's opinions might be about it. Feedback appreciated!

bfcarpio commented 1 year ago

I'd like to take a stab at the networking code. I've got a couple of ideas ranging from lazily adding a callback to writing some terrible XFactoryFactory OOP code.

jayaddison commented 1 year ago

in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags

From attempting some modifications, I'm going to suggest a small change to this part of the plan:

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

The benefit there is that we'll have a single place where settings are defined and read -- avoiding the multiple-config-files situation, and also avoiding multiple os imports appearing around the codebase.

jayaddison commented 1 year ago

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

class Template(AbstractScraper):

    # The self.soup object provides a way to write queries for contents on the page
    # For example: the following retrieves the contents of a 'title' HTML tag
    def title(self):
        return self.soup.find("title").get_text()

    # When a scraper field is accessed, the code in your method always runs first - it's a precise set of instructions provided by you.
    # You can also add generic extractors -- like 'schemaorg' here -- that are used if your code doesn't return a value during scraping.
    @field.schemaorg
    def ingredients(self):
        return self.soup.find(...)

    # Example: since this method is empty ('pass'), the opengraph metadata extractor will always be called
    # If opengraph extraction fails, then schemaorg extraction is called next
    @field.schemaorg
    @field.opengraph
    def image(self):
        pass
jayaddison commented 1 year ago

A pyproject.toml migration is ready for review in #655. After comparing the poetry and setuptools approaches, and learning various backwards compatibility and workflow questions related to them, I think that setuptools is a simpler route for us to take at the moment.

hhursev commented 1 year ago

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

I'd vote to make it just a settings.py/config.py file then and remove the RecipeScraperSettings class (whose idea was to allow users to write more complex settings (dict/list structures etc. through a file) and also settings change to take immediate effect w/o the need to restart the program/script using recipe_scrapers). We don't need that. FWIW I'd pick config.py for the naming.

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

I don't see it as improvement. In my opinion:

class Template(AbstractScraper):
    # .....

    def ingredients(self):
        return self.schema.ingredients()

    def image(self):
        return super().opengraph_image()

is good enough. If schema/opengraph image is there it will work. If not devs will adjust. I see no real downsides.

After comparing the poetry and setuptools approaches

let's not switch to poetry for now (ever?). I personally don't see much/any gains in our case. Open for discussion (I use poetry in other project I'm part of and see no real gain using it here).

jayaddison commented 1 year ago

I'm going to take some time out from maintenance/dev work here for a week or two. I feel like I've been commenting a lot and thinking/attempting a lot of stuff, but it's felt kinda disjointed or rushed at times.

Moving some of the larger/breaking changes here into the v15 branch makes sense I think - makes it easier to develop all that without worrying so much about maintaining compatibility. Hopefully in a few weeks I'll feel a bit refreshed to code up some more stuff.

hhursev commented 1 year ago

Hey @jayaddison take as much time as you need. v15 won't be merged in without your approval :wink:

You've done a great job in the last month! Thank you very much. The pyproject.toml introduction, the speeding of the CI :star:, the schema.org parser updates, merging incoming PRs in a speedy manner and addressing so much of the points in this issue. Prototyping different stuff and being prompt with communication on top of all that.

I really respect the work and time you spend on this project! Kudos to eager people like you supporting OSS.

jayaddison commented 1 year ago

remove the "plugins" idea altogether

Maybe there'd be an alternative implementation, but after getting back into the code again recently - I do think that the plugins are useful. #705 looks like a quick fix made possible by the SchemaOrg plugin, for example.

jayhale commented 1 year ago

Typing: Considering most developers are using IDEs that make use of type-hinting, it would be helpful if the base API were implemented with type signatures throughout.

jayaddison commented 10 months ago

Maybe controversial opinion: I think that the v15 branch should include a breaking change where instructions() returns type list[str] (instead of a newline-joined str), mirroring ingredients().

Making those two methods more equivalent should, I think, make the developer experience easier, and also I think it's a slightly better caller/user experience.

I won't lie: sometimes it's difficult to extract multi-item instructions from websites. But from experience in #852 recently, I think it's possible. In rare cases where it's not, or is too difficult, we can return a list containing a single item.

jayaddison commented 9 months ago

I've pushed commit 0b87a5141a0d5497566425110d6d7947ad1bc76f to the v15 branch -- it removes the scrape_me interface, meaning that the library doesn't have any network dependencies.

Is that good/bad? Is it a developer experience improvement? I'm not really sure. It's a bit annoying to have to retrieve some HTML from a URL, and then pass both the HTML and the URL for scraping. Not terrible, but compared to one import and one method call, it's kinda involved.

jayaddison commented 7 months ago

Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

I've published a release candidate, v15.0.0-rc1 that includes this change. I'll begin using that for RecipeRadar soon so that it gets some production testing.

jayaddison commented 7 months ago

Ah, no I haven't - v15.0.0-rc1 is completely network-isolated, but the suggestion there was to include a utility method to do the HTTP retrieval. I'll do that for rc2 (it'll re-add requests as an optional dependency, and I think that's fine).

jayaddison commented 7 months ago

Also: the scrape_html method can probably be simplified further in v15 to remove some of the proxy/timeout logic. That can move into the suggested utility method.

jayaddison commented 4 months ago
* Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.

@bfcarpio this might be super straightforward now that we use data-driven tests (#944). Would you like to take a look at that? I think a large amount of code could be cleaned up.

bfcarpio commented 4 months ago

I'll admit, I haven't worked on this repo in a long time. I'll be quite busy the next couple months as well. Funny how life challenges your priorities.

Yes, I would like to have this work done and reduce the code, but I can't commit to anything.

jayaddison commented 18 hours ago

I think that the tasks from here are complete-enough that this can be closed; the mypy type-hinting inference for subclass methods would be very nice, but may take a while to arrive and can be handled in future as a separate task. Changes to normalize_string likely require a more detailed specification and/or specific bugreports that they address.

Thank you all for your input :)

On a tangential note: I plan to release a final (non-release-candidate) of v15 within the next few days.