hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
844 stars 175 forks source link

TypeError: 'NoneType' object is not iterable #107

Open sgttwld opened 2 years ago

sgttwld commented 2 years ago

For me the GoogleImageCrawler of icrawler doesn't work anymore. I updated the user agent in crawler.py since that seemed to work in the past, but no luck here. I tried it both on python 3.8 and 3.9 (apple silicon, but shouldn't matter). Again, it worked in the past (like 3-6 months ago).

Even the simple example

from icrawler.builtin import GoogleImageCrawler
searchterm = 'ANY SEARCHTERM'
google_crawler = GoogleImageCrawler(storage={'root_dir': 'test'})
google_crawler.crawl(keyword=searchterm, max_num=1)

gives

...lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File ".../python3.8/site-packages/icrawler-0.6.6-py3.8.egg/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable

Does anyone know how to fix this, or have the same issue in July 2022?

Kir-1 commented 2 years ago

I have the same problem, I tried to roll back to the old version, it does not help. I saw that such a problem was already with this library i am executing the following code self.__search_word = 'cat' self.__count = 10

`google = GoogleImageCrawler(storage={"root_dir": path}) filters = dict( size='>1024x768', date=((2020, 1, 1), (2021, 11, 30)))

        google.crawl(keyword=self.__search_word, max_num=self.__count, filters=filters, offset=rnd.randint(0, 500))
    except Exception as _ex:
        logger.error("Something happened when uploading images", _ex)`

at the output I get 2022-07-10 10:36:53,907 - INFO - icrawler.crawler - start crawling... 2022-07-10 10:36:53,907 - INFO - icrawler.crawler - starting 1 feeder threads... 2022-07-10 10:36:53,914 - INFO - feeder - thread feeder-001 exit 2022-07-10 10:36:53,915 - INFO - icrawler.crawler - starting 1 parser threads... 2022-07-10 10:36:53,916 - INFO - icrawler.crawler - starting 1 downloader threads... 2022-07-10 10:36:54,117 - INFO - parser - parsing result page https://www.google.com/search?q=apex&ijn=1&start=150&tbs=isz%3Alt%2Cislt%3Axga%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2020%2Ccd_max%3A11%2F30%2F2021&tbm=isch Traceback (most recent call last): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run self._target(*self._args, self._kwargs) File "C:\Users\Administrator\PycharmProjects\BPG\venv37\lib\site-packages\icrawler\parser.py", line 104, in worker_exec for task in self.parse(response, kwargs): TypeError: 'NoneType' object is not iterable python-BaseException

Viachaslau85 commented 2 years ago

I have the same problem. Works with Bing and Baidu, but does not work with Google. I keep getting the following errors: 2022-07-27 18:52:22,851 - INFO - icrawler.crawler - start crawling... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 feeder threads... 2022-07-27 18:52:22,852 - INFO - icrawler.crawler - starting 1 parser threads... 2022-07-27 18:52:22,853 - INFO - icrawler.crawler - starting 4 downloader threads... 2022-07-27 18:52:23,323 - INFO - parser - parsing result page https://www.google.com/search?q=cat&ijn=0&start=0&tbs=isz%3Al%2Cic%3Aspecific%2Cisc%3Aorange%2Csur%3Afmc%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2017%2Ccd_max%3A11%2F30%2F2017&tbm=isch Exception in thread parser-001: Traceback (most recent call last): File "C:\Python310\lib\threading.py", line 1009, in _bootstrap_inner self.run() File "C:\Python310\lib\threading.py", line 946, in run self._target(*self._args, **self._kwargs) File "C:\Python310\lib\site-packages\icrawler\parser.py", line 104, in worker_exec for task in self.parse(response, **kwargs): TypeError: 'NoneType' object is not iterable 2022-07-27 18:52:27,857 - INFO - downloader - no more download task for thread downloader-001 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-001 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-003 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-003 exit 2022-07-27 18:52:27,858 - INFO - downloader - no more download task for thread downloader-004 2022-07-27 18:52:27,858 - INFO - downloader - thread downloader-004 exit 2022-07-27 18:52:27,859 - INFO - downloader - no more download task for thread downloader-002 2022-07-27 18:52:27,859 - INFO - downloader - thread downloader-002 exit 2022-07-27 18:52:27,894 - INFO - icrawler.crawler - Crawling task done!

feay1234 commented 2 years ago

got the same problem.

dravicenna commented 2 years ago

The same problem

philborman commented 1 year ago

The problem is in builtin/google.py replace the parse function around line 148 with this...

def parse(self, response):
    soup = BeautifulSoup(
        response.content.decode('utf-8', 'ignore'), 'lxml')
    images = soup.find_all(name='img')
    uris = []
    for img in images:
        if img.has_attr('src'):
            uris.append(img['src'])
    return [{'file_url': uri} for uri in uris]
Viachaslau85 commented 1 year ago

The problem is in builtin/google.py replace the parse function around line 148 with this...

def parse(self, response):
    soup = BeautifulSoup(
        response.content.decode('utf-8', 'ignore'), 'lxml')
    images = soup.find_all(name='img')
    uris = []
    for img in images:
        if img.has_attr('src'):
            uris.append(img['src'])
    return [{'file_url': uri} for uri in uris]

Much better, but it still doesn't work. It generates errors of the following type:

2022-12-15 09:27:02,994 - ERROR - downloader - Exception caught when downloading file //www.gstatic.com/images/branding/googlelogo/svg/googlelogo_clr_160x56px.svg, error: '', remaining retry times: 2 2022-12-15 09:27:02,996 - ERROR - downloader - Exception caught when downloading file //www.gstatic.com/images/branding/googlelogo/svg/googlelogo_clr_160x56px.svg, error: '', remaining retry times: 1

jfreyberg commented 1 year ago

I can confirm the comments from @philborman and @Viachaslau85. This particular issue seems to be solved by using the provided code snippet (use my fork to pull the relevant changes: pip install git+git://github.com/jfreyberg/icrawler@master --upgrade) but it yields the new downloading error.

philborman commented 1 year ago

The downloading error is easy to fix, Just add this line before uris.append

if 'images/branding/' not in img['src']:

On Mon, 19 Dec 2022, 20:45 Julian Freyberg, @.***> wrote:

I can confirm the comments from @philborman https://github.com/philborman and @Viachaslau85 https://github.com/Viachaslau85. This particular issue seems to be solved by using the provided code snippet (use my fork to pull the relevant changes: pip install @.*** --upgrade) but it yields the new downloading error.

— Reply to this email directly, view it on GitHub https://github.com/hellock/icrawler/issues/107#issuecomment-1358170182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4YOOIQFMFLBPJXVQ7IY3TWOC3O5ANCNFSM53D4UUDA . You are receiving this because you were mentioned.Message ID: @.***>

jfreyberg commented 1 year ago

Thank you @philborman! I had to modify it some more for the thing to work:

if 'images/branding/' not in img['src']:
    img_src = img['src']
    if not img_src.startswith('http'):
        img_src = 'https:' + img_src
    uris.append(img_src)

Somehow my URLs lacked the protocol.

I did not properly test this (I can only confirm it worked for google images) so I can not create a pull request, but if anyone wants to use my repo to fix this, feel free: https://github.com/jfreyberg/icrawler

masa8 commented 1 year ago

Everything was fine until yesterday. I got same problem today. I thought a few changes might get it to work, but it didn't.

At least the following code is working.

import requests
from bs4 import BeautifulSoup
import os

def save_images(save_dir, keywords):
  os.makedirs(save_dir, exist_ok=True)
  for keyword in keywords:
      url = f"https://www.google.com/search?q={keyword}&tbm=isch"
      res = requests.get(url)
      soup = BeautifulSoup(res.text, "html.parser")
      img_tags = soup.find_all("img")
      for i, img in enumerate(img_tags):
          try:
              img_url = img["src"]
              res = requests.get(img_url)
              with open(f"{save_dir}/{keyword}{str(i).zfill(5)}.jpg", "wb") as f:
                  f.write(res.content)
          except:
              continue

keywords = ["cat"]
save_dir = "train"
save_images(save_dir, keywords)

So I changed parse method like this:

class GoogleParser(Parser):
    def parse(self, response):
        soup = BeautifulSoup(response.text, 'html.parser')
        image_tags = soup.find_all("img")
        uris = []
        for img in image_tags:
          try:
            img_url = img["src"]

            res = requests.get(img_url) # Experiments only
            uris.append(img_url)

          except:
            continue
        print(len(uris)) #Experiments Result goes 0
        return [{'file_url': uri} for uri in uris]

and then run a following code.

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'train'})
google_crawler.crawl(keyword='cat', max_num=1)

but I got nothing.

Environment

icrawler(0.6.6) Pillow (8.4.0) six (1.16.0) lxml(4.9.2) beautifulsoup4 (4.11.2) requests (2.27.1) soupsieve(2.4) charset-normalizer(2.0.12) idna(3.4) certifi (2022.12.7) urllib3(1.26.15)

ZhiyuanChen commented 1 year ago

Could any one let me know if it's still persistent on 0.6.7?

simonmcnair commented 1 year ago

It's still not working for me.

megayounus786 commented 1 year ago

Also same error! Exception in thread parser-001: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, self._kwargs) File "/home/younus/.local/lib/python3.10/site-packages/icrawler/parser.py", line 94, in worker_exec for task in self.parse(response, kwargs): TypeError: 'NoneType' object is not iterable 2023-07-25 09:04:15,067 - INFO - downloader - no more download task for thread downloader-001 2023-07-25 09:04:15,069 - INFO - downloader - thread downloader-001 exit 2023-07-25 09:04:15,073 - INFO - icrawler.crawler - Crawling task done!

Broke this recently... tried different python versions but still no progress. Please help to fix this.

bretdavi commented 9 months ago

So found the solution, at least for my case, and it has to do with this line here: https://github.com/hellock/icrawler/blob/ad5633cab6e8960cda7bf797ad32ab960d1d6ef6/icrawler/builtin/google.py#L155

ds:0 and ds:1 are keys in the AF_initDataCallback data structure, and that's supposed to match key ds:1 and ignore ones that have the ds:0 key.

Problem is that it's just a basic <sub_str> in <str> check, so if an unrelated ds:0 substring appears in that massive block of text, it will skip it, and so it doesn't get parsed. Sure enough, did some debugging, and there was some unrelated string in the block that matched, breaking the parse logic.

Proper solution would be refined regex or perhaps even parsing the actual data structure into a native dict, or something. My lazy solution was to add single quotes around the keys as part of the substring. Much less likely to incorrectly match random other string that way.

if "'ds:0'" in txt or "'ds:1'" not in txt

Again, this is not a robust solution, but a quick hack to get things working 😆

I just created a custom parser class that inherits from the GoogleParser that added that change, and at least worked for me.

So if you're looking for a quick fix, try that. May try to get a more robust fix pushed as a PR, or if someone else wants to take a crack at it.

bretdavi commented 9 months ago

So found the solution, at least for my case, and it has to do with this line here:

https://github.com/hellock/icrawler/blob/ad5633cab6e8960cda7bf797ad32ab960d1d6ef6/icrawler/builtin/google.py#L155

ds:0 and ds:1 are keys in the AF_initDataCallback data structure, and that's supposed to match key ds:1 and ignore ones that have the ds:0 key.

Problem is that it's just a basic <sub_str> in <str> check, so if an unrelated ds:0 substring appears in that massive block of text, it will skip it, and so it doesn't get parsed. Sure enough, did some debugging, and there was some unrelated string in the block that matched, breaking the parse logic.

Proper solution would be refined regex or perhaps even parsing the actual data structure into a native dict, or something. My lazy solution was to add single quotes around the keys as part of the substring. Much less likely to incorrectly match random other string that way.

if "'ds:0'" in txt or "'ds:1'" not in txt

Again, this is not a robust solution, but a quick hack to get things working 😆

I just created a custom parser class that inherits from the GoogleParser that added that change, and at least worked for me.

So if you're looking for a quick fix, try that. May try to get a more robust fix pushed as a PR, or if someone else wants to take a crack at it.

Well hit the issue again with other searches, so guess that was just 'one' of the issues with it 😆

bretdavi commented 9 months ago

Seemed weird that the failures for this are sporadic for me. Not always failing on the same download. So did more debugging, and noticed that the response data for the failed parsing didn't have the same div elements as what was expected. Noticed some script content referencing XSRF, so that didn't seem like a great sign.

Added some logging to the parser worker_exec function to dump the response and request headers to a log file to see if that showed anything, and sure enough it did.

Noticed that successful requests have no cookie, but every failed request that occurs has a Cookie set in the request and response headers. I cleared all browser cookies for google, and then it seemed to get further, but guess it gets regenerated at some point. The feeder/parser/downloader are all sharing a common session, so maybe that has to do with it.

I'm definitely not an expert on web requests/security stuff, but seems that the cookie is getting generated at some point and it's then flagging the request as XSRF, or something like that. So that's why the parser fails to extract the image urls. Tried some tricks I found online to try and block the session from using cookies, but that didn't work.

Figured I'd post my findings though, as seems that this is likely the main culprit (though the other issue I reported is still valid)

bretdavi commented 9 months ago

So this isn't exactly a solution, but I decided to just switch and try out the BingImageCrawler, and it's been working perfectly fine.

I think it boils down to google more actively trying to block scraping, which leads to the issues I'd mentioned. Bing doesn't seem to care 😆 Probably need some more sophisticated implementation for the GoogleImageCrawler to avoid getting blocked.

Imho, don't see any real difference between using Google or Bing, so I'd just recommend swapping over and giving the other a shot. Again, not a fix, but I'd just avoid the GoogleImageCrawler for now

Patty-OFurniture commented 8 months ago

There is a log line right above the error "INFO - parser - parsing result page" - this is the URL where the error happened. Copy this URL into a browser and verify it works. And it looks like people have been examining the HTML but not saving/posting it, so it's almost impossible to diagnose. Using a random offset makes it definitely impossible.

You could hack up parser.py:

self.logger.info(f"parsing result page {url}") self.logger.debug(response.content)

or:

task_list = self.parse(response, **kwargs) if not task_list: self.logger.debug("self.parse() returned no tasks") with open("task_list_error.log", 'ab') as f: f.write(response.content) f.write("\n")

That's not a solution, but it would help find the core problem. I found it when I was hacking on the search filters. I sent in garbage and got no results. Logging the URL and actual page, as above, will help.

Sorry about any formatting, the code blocks seem inconsistent.

axy1976 commented 5 months ago

I am facing the same issue, Is there any stable solution to this problem?

2024-04-08 10:48:52,455 - INFO - icrawler.crawler - start crawling...
2024-04-08 10:48:52,455 - INFO - icrawler.crawler - starting 1 feeder threads...
2024-04-08 10:48:52,455 - INFO - feeder - thread feeder-001 exit
2024-04-08 10:48:52,455 - INFO - icrawler.crawler - starting 2 parser threads...
2024-04-08 10:48:52,456 - INFO - icrawler.crawler - starting 4 downloader threads...
2024-04-08 10:48:54,083 - INFO - parser - parsing result page https://www.google.com/search?q=Computer+Networks+Advanced&ijn=0&start=0&tbs=sur%3Afmc&tbm=isch
Exception in thread parser-001:
Traceback (most recent call last):
  File "/Users/axyzcodes/.pyenv/versions/3.9.18/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/Users/axyzcodes/.pyenv/versions/3.9.18/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/axyzcodes/.pyenv/versions/3.9.18/lib/python3.9/site-packages/icrawler/parser.py", line 94, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable
2024-04-08 10:48:54,456 - INFO - parser - no more page urls for thread parser-002 to parse
2024-04-08 10:48:54,456 - INFO - parser - thread parser-002 exit
2024-04-08 10:48:57,462 - INFO - downloader - no more download task for thread downloader-001
2024-04-08 10:48:57,463 - INFO - downloader - no more download task for thread downloader-002
2024-04-08 10:48:57,463 - INFO - downloader - no more download task for thread downloader-003
2024-04-08 10:48:57,463 - INFO - downloader - thread downloader-003 exit
2024-04-08 10:48:57,463 - INFO - downloader - no more download task for thread downloader-004
2024-04-08 10:48:57,464 - INFO - downloader - thread downloader-004 exit
2024-04-08 10:48:57,463 - INFO - downloader - thread downloader-002 exit
2024-04-08 10:48:57,463 - INFO - downloader - thread downloader-001 exit
2024-04-08 10:48:57,478 - INFO - icrawler.crawler - Crawling task done!

worked fine for 4 months and suddenly this problem occurred,

Patty-OFurniture commented 5 months ago

First, "stable" will never happen because the search providers can change their results pages at any time.

Second, when I have hit this error, I created a log file of the actual results. I just had to re-try google and it worked the second time. But the log didn't show anything interesting to be fixed, as far as I've had time to investigate.

And finally, google **just last week changed their results page to assume javascript is enabled. And the results are not in the same format. Looks like they are actively fighting against projects like this one but I haven't had time to really dig in. The interesting parts seem to be "encrypted" (their word for the property) somehow. There is a noscript** tag with a redirect, but I haven't figured out a good way to insert that back in the queue. But the current logic expects script tags for each image, so it would also have to be updated for the noscript page results.

So all scriptless crawlers are broken until fixed..

axy1976 commented 5 months ago

So, stable fixes cannot be possible.

Thank you for the reply. Appreciate your time.

Patty-OFurniture commented 5 months ago

To be clear, the same problem exists for all crawlers. The results could change at any time.

I use image downloader sometimes, there's a fork in my repos with some fixes. I think it still works with Google, but it uses Chrome driver to actually run a browser, for Google results. so I prefer icrawler most of the time.

On Mon, Apr 8, 2024, 10:23 AM Akshay Jangir @.***> wrote:

So, stable fixes cannot be possible.

Thank you for the reply. Appreciate your time.

— Reply to this email directly, view it on GitHub https://github.com/hellock/icrawler/issues/107#issuecomment-2042892193, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6JQB6ORSARV4ZEMSFUSGSLY4KR73AVCNFSM53D4UUDKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGI4DSMRRHEZQ . You are receiving this because you commented.Message ID: @.***>

ZhiyuanChen commented 4 months ago

Please let me know if 0.6.8 fixes this issue~