[Feature Request] S3 Storage

mgrist commented 1 year ago

It would be great if you could specify an S3 bucket URI in the scrapyd.conf file for the eggs_dir, logs_dir, items_dir, etc...

jpmckinney commented 1 year ago

What would be the expected behavior?

Scrapy can write lines to log files, items files, etc. at a very high frequency, and over a very long period of time - it would not make sense to store these on S3 while the files are "open". It's perhaps possible (once the file is closed) to transfer these files to S3... but that's something you can do as a separate job (like a backup script) - it's not clear that it should be something Scrapyd is responsible for.

I assume this need arises from attempting to run Scrapyd on a host with only temporary storage (like Heroku). To get it working on such a platform:

Scrapyd ships with one implementation of the egg storage interface, FilesystemEggStorage, which is the only part of the code that uses eggs_dir. You will need to write your own implementation to use S3 (or whatever else) as you wish. https://scrapyd.readthedocs.io/en/stable/config.html#eggstorage
items_dir is in fact empty (disabled) by default in Scrapyd. It's recommended to have your spiders write to a database or some other feed exporter (see Scrapy documentation, which includes using S3).
Scrapy writes log files to standard output or to files. You can either configure your host to forward stdout (Heroku docs) to some service like Logstash, or you can reconfigure Scrapy's logger to behave as you wish (Scrapy uses Python logging).

jpmckinney commented 1 year ago

I've made a commit to clarify the above in the docs: e92edd6

If you implement a new egg storage option, please feel free to open a pull request.

mgrist commented 1 year ago

@jpmckinney Thanks for the information! I wasn't aware of some of those features. You are correct that I am running scrapyd on temporary storage. I created my own separate script that runs afterward that sends the logs to S3. I think you are correct that it isn't scrapyd's job to do this. Thanks again for the insight.

scrapy / scrapyd

[Feature Request] S3 Storage #477