rhoboro / splash_gcs

0 stars 0 forks source link

storing scrapy output directly in bucket #1

Open zinyosrim opened 6 years ago

zinyosrim commented 6 years ago

Meanwhile I managed to output the images to the bucket - thanks. But, my scraping data is not stored in my_bucket:

# Configure item pipelines
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'gs://my_bucket/images_'
FILES_STORE = 'gs://my_bucket/files_'
GCS_PROJECT_ID = 'xxxx'

This is how I execute:

export GOOGLE_APPLICATION_CREDENTIALS=google-api-keys.json
scrapy crawl my_spider -o my_spider.json

Did I miss anything? Thanks Zin

rhoboro commented 6 years ago

But, my scraping data is not stored in my_bucket:

What's data you want to store? If you want to use FILES_STORE, you need to enabling Files Pipeline. Please check here. https://scrapy.readthedocs.io/en/stable/topics/media-pipeline.html#enabling-your-media-pipeline

zinyosrim commented 6 years ago

I meant doing things like:

in settings_py: LOG_FILE = 'gs://my_bucket/scrapy_log.txt' or scrapy crawl my_spider -o gs://my_bucket/scrapy_items.json

This is possible with s3

rhoboro commented 6 years ago

Sorry, this function is called Feed exports and not support yet. https://github.com/scrapy/scrapy/issues/3044#issuecomment-352942342

If you want to export log to bucket, need to write custom code. I think that it is good to refer to this S3 code.

https://github.com/scrapy/scrapy/blob/dfe6d3d59aa3de7a96c1883d0f3f576ba5994aa9/scrapy/extensions/feedexport.py#L94

zinyosrim commented 6 years ago

ok - thanks