q-m / scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
http://developers.thequestionmark.org/scrapy-webarchive/
2 stars 0 forks source link

Specify the complete destination with `SW_EXPORT_URI` #18

Closed leewesleyv closed 3 weeks ago

leewesleyv commented 4 weeks ago

We want to specify the complete destination rather than generating it with SW_EXPORT_URI as a prefix.

It would be nice if the invoking system can determine the final filename, otherwise it needs to guess the resulting WACZ filename afterwards. This can help finding the resulting WACZ later, to have a complete filename already.

_Originally posted by @wvengen in https://github.com/q-m/scrapy-webarchive/pull/17#discussion_r1819002662_

wvengen commented 4 weeks ago

Note that there could still be a use-case for providing a path and generating a sensible filename, but I would also like to be able to specify the path to the file.

leewesleyv commented 4 weeks ago

Let's see if we can implement this using the current setting, or if it would be cleaner to introduce an additional setting.

wvengen commented 4 weeks ago

Good point.

wvengen commented 4 weeks ago

Perhaps we can keep it a single setting, where one would normally include variables to generate the filename. That would make the filename also more explicit.

leewesleyv commented 3 weeks ago

Code-wise it would've been a bit cleaner to split up the settings. However, from the perspective of the user it makes more sense to combine the 2 and just define an output location.

Feature implemented in a570295

wvengen commented 3 weeks ago

Super!

I didn't see this right away in the code (but it must be there), hence the question: how do you know whether the provided setting is a folder, or a full path? I could imagine if it ends with a / - in that case it needs to be documented well. Ah, now I see isdir - does that work on object storage too?

I'm ok with not allowing a folder, but always a filename (when the example gives a suitable default, that would be fine - you need to configure it anyway, there is no sensible default for storage location).