scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.92k stars 569 forks source link

[Feature Request] S3 Storage #477

Closed mgrist closed 1 year ago

mgrist commented 1 year ago

It would be great if you could specify an S3 bucket URI in the scrapyd.conf file for the eggs_dir, logs_dir, items_dir, etc...

jpmckinney commented 1 year ago

What would be the expected behavior?

Scrapy can write lines to log files, items files, etc. at a very high frequency, and over a very long period of time - it would not make sense to store these on S3 while the files are "open". It's perhaps possible (once the file is closed) to transfer these files to S3... but that's something you can do as a separate job (like a backup script) - it's not clear that it should be something Scrapyd is responsible for.

I assume this need arises from attempting to run Scrapyd on a host with only temporary storage (like Heroku). To get it working on such a platform:

jpmckinney commented 1 year ago

I've made a commit to clarify the above in the docs: e92edd6

If you implement a new egg storage option, please feel free to open a pull request.

mgrist commented 1 year ago

@jpmckinney Thanks for the information! I wasn't aware of some of those features. You are correct that I am running scrapyd on temporary storage. I created my own separate script that runs afterward that sends the logs to S3. I think you are correct that it isn't scrapyd's job to do this. Thanks again for the insight.