q-m / scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
http://developers.thequestionmark.org/scrapy-webarchive/
2 stars 0 forks source link

Initial code-review #5

Open wvengen opened 6 days ago

wvengen commented 6 days ago

Thanks for the project setup! It's looking quite nice already. Really happy that you've added some tests as well!! (Though I'd like to look at their completeness at another time.) After a first code review, I noticed the following points:

leewesleyv commented 5 days ago

@wvengen On the following point:

"What happens if wacz_files is empty? You added a check here, curious what would happen. Same actually a bit above, when a single file could not be opened."

I was looking for a way to raise an exception, similar to the NotConfigured exception in the init method when the setting is not set. However, I did not find a proper way to handle this, mainly because we have this logic in the signal instead of in the init.

We currently check in the middleware if self.wacz is set with hasattr(obj, 'wacz') and have some logic for it, however this can definitely be improved. Will continue to look for an alternative solution here!

leewesleyv commented 2 days ago

Adding co-authors to old commits does not seem to be straight forward. Would involve quite a bit of re-writing history. Let's see if we can add co-authors in a different way!

wvengen commented 2 days ago

Yes, please re-write history. That is fine at this stage of the project (but preferably not after it has been released).

leewesleyv commented 2 days ago

Added @wvengen to the co-authors for the commit where we implement the initial donwload/spider middleware (https://github.com/q-m/scrapy-webarchive/commit/0323aeb1b27c8b035c92b39e458b6309b38aef04). As far as I could see to only contributer to this functionality. If I have missed anyone please let me know!