scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

GZipPlugin does not work with S3 #6289

Closed masaez closed 1 month ago

masaez commented 1 month ago

Description

Using GzipPlugin combined with s3 storage does not work. I believe it is related to this comment https://github.com/scrapy/scrapy/issues/5928#issuecomment-1545835789, after configuring s3 and GZip like this:

FEEDS = {
    "s3://my-bucket/feeds/%(name)s/%(time)s.gz": {
        "format": "jsonlines",
        "postprocessing": [GzipPlugin],
        "gzip_compresslevel": 5,
    },
}

I get the following error:

2024-03-11 18:20:48 [scrapy.extensions.feedexport] ERROR: Error storing jsonlines feed (110 items) in: s3://my-bucket/feeds/quotes/2024-03-11T21-20-44+00-00.gz
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/twisted/python/threadpool.py", line 269, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/threadpool.py", line 285, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/extensions/feedexport.py", line 244, in _store_in_thread
    file.seek(0)
  File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/tempfile.py", line 483, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
ValueError: seek of closed file

Versions

Scrapy : 2.11.1 lxml : 4.9.2.0 libxml2 : 2.9.4 cssselect : 1.2.0 parsel : 1.8.1 w3lib : 2.1.2 Twisted : 24.3.0 Python : 3.11.3 (main, Apr 7 2023, 19:29:16) [Clang 14.0.0 (clang-1400.0.29.202)] pyOpenSSL : 24.1.0 (OpenSSL 3.2.1 30 Jan 2024) cryptography : 42.0.5 Platform : macOS-12.7.1-x86_64-i386-64bit

Gallaecio commented 1 month ago

I believe this is a duplicate of https://github.com/scrapy/scrapy/issues/5932, which has been fixed in the main branch but not released yet. Could you confirm it works if you install Scrapy from the main branch? (i.e. pip install git+https://github.com/scrapy/scrapy.git)

masaez commented 1 month ago

Thank you @Gallaecio for your response! Yes! It did work. How is the usual release cycle? I would need it available on Zyte and I don't think I can configure scrapinghub.yml to use main branch.

Gallaecio commented 1 month ago

How is the usual release cycle?

There is no fixed period, we usually wait until we have a few big features merged. I don’t expect 2.12 to be released too soon, not sooner than 2 months, but it’s just a guess, it could be a bit sooner or much later.

I don't think I can configure scrapinghub.yml to use main branch.

You can, you can add Scrapy to your requirements.txt file (e.g. Scrapy @ git+https://github.com/scrapy/scrapy.git@6fc78270427c41e401a01a46551d27dd4ddf846c), and it will replace the default version from the selected stack. Using the latest stack here would be best.