Open wvengen opened 6 days ago
@wvengen On the following point:
"What happens if wacz_files
is empty? You added a check here, curious what would happen. Same actually a bit above, when a single file could not be opened."
I was looking for a way to raise an exception, similar to the NotConfigured
exception in the init method when the setting is not set. However, I did not find a proper way to handle this, mainly because we have this logic in the signal instead of in the init.
We currently check in the middleware if self.wacz
is set with hasattr(obj, 'wacz')
and have some logic for it, however this can definitely be improved. Will continue to look for an alternative solution here!
Adding co-authors to old commits does not seem to be straight forward. Would involve quite a bit of re-writing history. Let's see if we can add co-authors in a different way!
Yes, please re-write history. That is fine at this stage of the project (but preferably not after it has been released).
Added @wvengen to the co-authors for the commit where we implement the initial donwload/spider middleware (https://github.com/q-m/scrapy-webarchive/commit/0323aeb1b27c8b035c92b39e458b6309b38aef04). As far as I could see to only contributer to this functionality. If I have missed anyone please let me know!
Thanks for the project setup! It's looking quite nice already. Really happy that you've added some tests as well!! (Though I'd like to look at their completeness at another time.) After a first code review, I noticed the following points:
[x] The amount of in-code documentation quite varies. e.g.
downloadermiddlewares.py
has class documentation, butmiddlewares.py
not. I think it would be helpful to have it in both (or even point to a documentation file if you like to avoid duplication).[x]
middleware
vs.downloadermiddlewares
- both contain multiple middlewares, but one is singular, the other is plural.[x] The log could distinguish between opening a single WACZ vs. multiple ones (can be helpful when debugging) https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/middleware.py#L40-L57
[x] Now that we've settled more on a project name, it may be good to bring statistics naming in-line with it, e.g.
sw/
orwebarchive/
(also on other places). https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/middleware.py#L88[x] Probably something to fix later, but ideally tests work without network access. Would require starting a local webserver, probably. https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/tests/test_extensions.py#L50
[x] This looks like you may have copied it from somewhere. Please document if so (e.g. provide permalink), so that when e.g. Scrapy changes these internals a bit, it's easy to find what needs to change. (But perhaps #1 would change this anyway.) https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/extensions.py#L60-L77
[x] Usage doc: could help to make clearer that when using
WaczCrawlMiddleware
you also needWaczMiddleware
. So consider the Lookup and Iterating sections to be kind of self-contained examples.[x] Did you use much code from the crawling-part we already had in another project? May be nice to add co-authored-by to these commits.
[x] What happens if
wacz_files
is empty? You added a check here, curious what would happen. Same actually a bit above, when a single file could not be opened. https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/middleware.py#L59-L60