It turns out, that refactoring the WebsiteData structure and equipping it with some caching functionality automatically resolves the issue about concurrency.
This is nice, for a couple of reasons:
future proof (more properties could be added to the new Content class (FKA WebsiteData)
allows to move expensive, shared content calculations (such as the raw_links) into a ProcessPool to lessen the CPU consumption in the main event loop.
works towards having individual endpoints for the individual extractors (all that is missing is some "inter request" caching of the Content objects)
allows to stick with a simple Extractor API.
TODO:
[x] finish exception handling when splash requests fail.
[x] fix tests (they are probably all broken after the refactoring)
Resolves #149
It turns out, that refactoring the
WebsiteData
structure and equipping it with some caching functionality automatically resolves the issue about concurrency.This is nice, for a couple of reasons:
Content
class (FKAWebsiteData
)raw_links
) into a ProcessPool to lessen the CPU consumption in the main event loop.Content
objects)Extractor
API.TODO: