michaelharms / comcrawl

A python utility for downloading Common Crawl data
https://github.com/michaelharms/comcrawl#readme
MIT License
220 stars 38 forks source link

client.download() not working. Gives error (Not a gzipped file (b'<?')). #40

Open hamzashah47 opened 2 years ago

hamzashah47 commented 2 years ago

client = IndexClient(["2019-51", "2019-47"]) client.search("reddit.com/r/MachineLearning/*") client.download()

trying to download html pages but not working. Gives error (Not a gzipped file (b'<?')).

kouohhashi commented 2 years ago

Hi I have same issue. Have you find a workaround?

sufyanel commented 2 years ago

yes i fixed it. I created a fork and add this code chunk in try except block. It worked for me.

kouohhashi commented 2 years ago

I installed comcrawl today by pip install comcrawl.

And I did

from comcrawl import IndexClient
client = IndexClient(["2019-51", "2019-47"], verbose=True)
client.download()

is there anything I can dig?

sufyanel commented 2 years ago

create a fork from original repo and add try except in client.download() method. Or i can send you my module if you can share your email.

kouohhashi commented 2 years ago

When I use "CC-MAIN-2022-33" as index like below,

from comcrawl import IndexClient
client = IndexClient(["CC-MAIN-2022-33"])
client.search("reddit.com/r/MachineLearning/*")
client.download()
client.results

I did not get an error but client.results is [].

When I use "2022-33" as index like below,

from comcrawl import IndexClient
client = IndexClient(["2022-33"])
client.search("reddit.com/r/MachineLearning/*")
client.download()

I got an error.

I'm not sure how to set index correctly.

Thanks in advance.

customer101 commented 1 year ago

I'm facing the same problem. I changed the URL_TEMPLATE here to become URL_TEMPLATE = "https://data.commoncrawl.org/{filename}" following this announcement

EDIT: It seems this PR did the same, I don't know why it doesn't get merged https://github.com/michaelharms/comcrawl/pull/41

jaceaser commented 1 year ago

I have issues and get this error. Is there something I'm missing in my setup?

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment raise reraise(type(error), error, _stacktrace) File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise raise value.with_traceback(tb) File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 536, in _make_request response = conn.getresponse() File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 454, in getresponse httplib_response = super().getresponse() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1368, in getresponse response.begin() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 317, in begin version, status, reason = self._read_status() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 286, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

jaceaser commented 1 year ago

Nevermind, my issue was VPN related.