Open hamzashah47 opened 2 years ago
Hi I have same issue. Have you find a workaround?
yes i fixed it. I created a fork and add this code chunk in try except block. It worked for me.
I installed comcrawl today by pip install comcrawl.
And I did
from comcrawl import IndexClient
client = IndexClient(["2019-51", "2019-47"], verbose=True)
client.download()
is there anything I can dig?
create a fork from original repo and add try except in client.download() method. Or i can send you my module if you can share your email.
When I use "CC-MAIN-2022-33" as index like below,
from comcrawl import IndexClient
client = IndexClient(["CC-MAIN-2022-33"])
client.search("reddit.com/r/MachineLearning/*")
client.download()
client.results
I did not get an error but client.results is [].
When I use "2022-33" as index like below,
from comcrawl import IndexClient
client = IndexClient(["2022-33"])
client.search("reddit.com/r/MachineLearning/*")
client.download()
I got an error.
I'm not sure how to set index correctly.
Thanks in advance.
I'm facing the same problem. I changed the URL_TEMPLATE here
to become URL_TEMPLATE = "https://data.commoncrawl.org/{filename}"
following this announcement
EDIT: It seems this PR did the same, I don't know why it doesn't get merged https://github.com/michaelharms/comcrawl/pull/41
I have issues and get this error. Is there something I'm missing in my setup?
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment raise reraise(type(error), error, _stacktrace) File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise raise value.with_traceback(tb) File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 536, in _make_request response = conn.getresponse() File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 454, in getresponse httplib_response = super().getresponse() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1368, in getresponse response.begin() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 317, in begin version, status, reason = self._read_status() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 286, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Nevermind, my issue was VPN related.
client = IndexClient(["2019-51", "2019-47"]) client.search("reddit.com/r/MachineLearning/*") client.download()
trying to download html pages but not working. Gives error (Not a gzipped file (b'<?')).