jamesturk / spatula

A modern Python library for writing maintainable web scrapers.
https://jamesturk.github.io/spatula/
MIT License
244 stars 11 forks source link

Improve (& document) scraper testing workflow #37

Open jamesturk opened 1 year ago

jamesturk commented 1 year ago

I want to think through this a bit & welcome feedback from anyone that'd like better ways to test their scrapers written using spatula.

The problem this is attempting to solve is that when writing scrapers, you might want the ability to test against a cached page, you would also want the ability to update your cached copy easily. This feels like it falls well within spatula's domain and spatula could offer a solution that works for common cases.

I've considered a few approaches & currently leaning towards the following:

Idea: Provide helper to turn page into a TestablePage

Sources are responsible for fetching themselves in Source.get_response, by replacing sources with special caching versions, an existing Page can be tested against a cached response.

def test_example_page():

this would replace all of a page's sources with a new TestCacheURL, other parameters would stay the same

page = make_testable_page(ExamplePage(...)) 
assert page.process_page() == [1, 2, 3]

TestCacheURL would do the following:

check a configurable location (spatula_testdata.sqlite3) for a cached copy of the response, if present, return as-is
if a URL isn't present in the cache this would be an error unless a special (SPATULA_TEST_UPDATE_SOURCES) environment variable is set
to make this easier to use, the CLI interface could add methods to check the status of the cache/clear entries/etc.

This would be pretty simple for 80% of cases, it might get complicated for pages that yield back other pages, etc. since presumably you'd want to have their sources replaced too.

I'd also considered just having a global flag that alters how URL sources work (SPATULA_TEST_MODE) but not sure I like that approach yet.

jefftriplett commented 1 year ago

It'd be nice to be able to pass a string (response or byte-string or whatever you think is best) into our Page class via ExamplePage(response="...") where "..." is the contents of what we'd like it to parse. If we need to wrap the string with a Response object or something that's fine too. Then we can test the scraper against it.

I'm happy to expand this more if it'd be helpful.

jefftriplett commented 1 year ago

This looks handy: https://github.com/jamesturk/spatula/blob/2bf8f378c8a83d36fb50b362a5895181794ee733/tests/test_pages.py#L18-L37

I'll try again to test a few of my scrapers this week. I think the main pain point is having a way to test selectors for a given page type and quickly see what broke.