Initial code-review - Githubissues

wvengen commented 6 days ago

Thanks for the project setup! It's looking quite nice already. Really happy that you've added some tests as well!! (Though I'd like to look at their completeness at another time.) After a first code review, I noticed the following points:

[x] The amount of in-code documentation quite varies. e.g. downloadermiddlewares.py has class documentation, but middlewares.py not. I think it would be helpful to have it in both (or even point to a documentation file if you like to avoid duplication).
[x] middleware vs. downloadermiddlewares - both contain multiple middlewares, but one is singular, the other is plural.
[x] The log could distinguish between opening a single WACZ vs. multiple ones (can be helpful when debugging) https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/middleware.py#L40-L57
[x] Now that we've settled more on a project name, it may be good to bring statistics naming in-line with it, e.g. sw/ or webarchive/ (also on other places). https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/middleware.py#L88
[x] Probably something to fix later, but ideally tests work without network access. Would require starting a local webserver, probably. https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/tests/test_extensions.py#L50
[x] This looks like you may have copied it from somewhere. Please document if so (e.g. provide permalink), so that when e.g. Scrapy changes these internals a bit, it's easy to find what needs to change. (But perhaps #1 would change this anyway.) https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/extensions.py#L60-L77
[x] Usage doc: could help to make clearer that when using WaczCrawlMiddleware you also need WaczMiddleware. So consider the Lookup and Iterating sections to be kind of self-contained examples.
[x] Did you use much code from the crawling-part we already had in another project? May be nice to add co-authored-by to these commits.
[x] What happens if wacz_files is empty? You added a check here, curious what would happen. Same actually a bit above, when a single file could not be opened. https://github.com/q-m/scrapy-webarchive/blob/a22c34f4b842e0f1c0b9c4f395924aaadedf19b3/scrapy_webarchive/middleware.py#L59-L60

leewesleyv commented 5 days ago

@wvengen On the following point:

"What happens if wacz_files is empty? You added a check here, curious what would happen. Same actually a bit above, when a single file could not be opened."

I was looking for a way to raise an exception, similar to the NotConfigured exception in the init method when the setting is not set. However, I did not find a proper way to handle this, mainly because we have this logic in the signal instead of in the init.

We currently check in the middleware if self.wacz is set with hasattr(obj, 'wacz') and have some logic for it, however this can definitely be improved. Will continue to look for an alternative solution here!

leewesleyv commented 2 days ago

Adding co-authors to old commits does not seem to be straight forward. Would involve quite a bit of re-writing history. Let's see if we can add co-authors in a different way!

wvengen commented 2 days ago

Yes, please re-write history. That is fine at this stage of the project (but preferably not after it has been released).

leewesleyv commented 2 days ago

Added @wvengen to the co-authors for the commit where we implement the initial donwload/spider middleware (https://github.com/q-m/scrapy-webarchive/commit/0323aeb1b27c8b035c92b39e458b6309b38aef04). As far as I could see to only contributer to this functionality. If I have missed anyone please let me know!

q-m / scrapy-webarchive

Initial code-review #5