Closed LostInDarkMath closed 6 months ago
Sorry for the inconvenience, could you please try to clone this project and build it manually? This has been fixed by #93 but I'm a bit busy to build and release a new package.
I don't have the time to build it either. I just wanted to quickly test your library to see if it was suitable for my use case. And I'm probably not the only one having this problem either. Should all users now build this manually?
I don't have the time to build it either. I just wanted to quickly test your library to see if it was suitable for my use case. And I'm probably not the only one having this problem either. Should all users now build this manually?
Sorry again for the inconvenience, I have updated the package on pypi
It works now! Thank you were much :)
I have the same problem. Works with Bing and Baidu, but does not work with Google. I keep getting the following errors: 2022-07-27 18:52:22,851 - INFO - icrawler.crawler - start crawling... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 feeder threads... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 parser threads... 2022-07-27 18:52:22,853 - INFO - icrawler.crawler - starting 4 downloader threads... 2022-07-27 18:52:23,323 - INFO - parser - parsing result page https://www.google.com/search?q=cat&ijn=0&start=0&tbs=isz%3Al%2Cic%3Aspecific%2Cisc%3Aorange%2Csur%3Afmc%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2017%2Ccd_max%3A11%2F30%2F2017&tbm=isch Exception in thread parser-001: Traceback (most recent call last): File "C:\Python310\lib\threading.py", line 1009, in _bootstrap_inner self.run() File "C:\Python310\lib\threading.py", line 946, in run self._target(*self._args, self._kwargs) File "C:\Python310\lib\site-packages\icrawler\parser.py", line 104, in worker_exec for task in self.parse(response, kwargs): TypeError: 'NoneType' object is not iterable 2022-07-27 18:52:27,857 - INFO - downloader - no more download task for thread downloader-001 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-001 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-003 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-003 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-004 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-004 exit 2022-07-27 18:52:27,859 - INFO - downloader - no more download task for thread downloader-002 2022-07-27 18:52:27,859 - INFO - downloader - thread downloader-002 exit 2022-07-27 18:52:27,894 - INFO - icrawler.crawler - Crawling task done!
I have the same problem. Works with Bing and Baidu, but does not work with Google. I keep getting the following errors: 2022-07-27 18:52:22,851 - INFO - icrawler.crawler - start crawling... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 feeder threads... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 parser threads... 2022-07-27 18:52:22,853 - INFO - icrawler.crawler - starting 4 downloader threads... 2022-07-27 18:52:23,323 - INFO - parser - parsing result page https://www.google.com/search?q=cat&ijn=0&start=0&tbs=isz%3Al%2Cic%3Aspecific%2Cisc%3Aorange%2Csur%3Afmc%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2017%2Ccd_max%3A11%2F30%2F2017&tbm=isch Exception in thread parser-001: Traceback (most recent call last): File "C:\Python310\lib\threading.py", line 1009, in _bootstrap_inner self.run() File "C:\Python310\lib\threading.py", line 946, in run self._target(*self._args, self._kwargs) File "C:\Python310\lib\site-packages\icrawler\parser.py", line 104, in worker_exec for task in self.parse(response, kwargs): TypeError: 'NoneType' object is not iterable 2022-07-27 18:52:27,857 - INFO - downloader - no more download task for thread downloader-001 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-001 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-003 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-003 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-004 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-004 exit 2022-07-27 18:52:27,859 - INFO - downloader - no more download task for thread downloader-002 2022-07-27 18:52:27,859 - INFO - downloader - thread downloader-002 exit 2022-07-27 18:52:27,894 - INFO - icrawler.crawler - Crawling task done!
This is not relevant to this issue, looks like https://github.com/hellock/icrawler/issues/107
seem's this problem is back
Any solution for this problem?
Looks like some or many website's hosts are identifiying bots and asking for human validation, causing the problem.
change the code like this. helped to me. file ....\site-packages\icrawler\parser,py
uris = re.findall(r"http[^[]*?.(?:jpg|png|bmp)", txt) uris = [bytes(uri, 'utf-8').decode('unicode-escape') for uri in uris] if uris: return [{"file_url": uri} for uri in uris]
change the code like this. helped to me. file ....\site-packages\icrawler\parser,py
uris = re.findall(r"http[^[]*?.(?:jpg|png|bmp)", txt) uris = [bytes(uri, 'utf-8').decode('unicode-escape') for uri in uris] if uris: return [{"file_url": uri} for uri in uris]
Would you mind to submit a PR?
@ZhiyuanChen sorry I was wrong. @OxFF00FF's change did fix it, but I didn't reload the module properly. Now google crawler works.
I still don't understand why @OxFF00FF's change works. Can anyone explain? Both before and after the change, a list of uris is returned from parse
, whether the urls are decoded or not. Why does the error complain about the result of parse
being none
? uris
should still be an iterable list before the change.
Thanks.
Please let me know if 0.6.8 fixes this issue~
Hi there, I just tried out your library, but unfortunately, I get an error:
But I just use your example code:
How can I fix the problem?