hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
857 stars 174 forks source link

Exception: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) #76

Closed shubham0204 closed 4 years ago

shubham0204 commented 4 years ago

I have been using icrawler to scrap some images from Google Search. I have used this code,

from icrawler.builtin import GoogleImageCrawler

keyword = 'elon_musk'
num_images =   200
output_directory = 'dataset/images'

google_crawler = GoogleImageCrawler(storage={ 'root_dir': output_directory})
google_crawler.crawl(keyword=keyword, max_num=num_images , filters=None )

The execution ends with this exception,

Exception in thread parser-001:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "/usr/local/lib/python3.6/dist-packages/icrawler/builtin/google.py", line 157, in parse
    meta = json.loads(txt)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I am using icrawler in Google Colab hence with Python version 3.6.9 on Google Chrome browser.

gcheron commented 4 years ago

I have proposed a fix in PR #74

quhb2455 commented 4 years ago

I have proposed a fix in PR #74

hi gcheron, i really thank you to try to fix it!! i just check your fixed code and then copy it to google.py but it still has same problem

`

def parse(self, response):
    soup = BeautifulSoup(
        response.content.decode('utf-8', 'ignore'), 'lxml')
    image_divs = soup.find_all('script')
    for div in image_divs:
        txt = div.string
        if txt is None or not txt.startswith('AF_initDataCallback'):
            continue
        if 'ds:1' not in txt:
            continue
        txt=re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:(.+)}\);?$",
                                "\\2", txt, 0, re.DOTALL)

        meta = json.loads(txt)
        data = meta[31][0][12][2]

        uris = [img[1][3][0] for img in data if img[0] == 1]
        return [{'file_url': uri} for uri in uris]

`

Exception in thread parser-002: Traceback (most recent call last): File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\threading.py", line 917, in _bootstrap_inner self.run() File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\threading.py", line 865, in run self._target(*self._args, self._kwargs) File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\site-packages\icrawler\parser.py", line 104, in worker_exec for task in self.parse(response, kwargs): File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\site-packages\icrawler\builtin\google.py", line 157, in parse meta = json.loads(txt) File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\json__init__.py", line 348, in loads return _default_decoder.decode(s) File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\Rentalhub\anaconda3\envs\joong\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

xbreid commented 4 years ago

I have proposed a fix in PR #74

Can confirm this PR fixes the issue for me.

quhb2455 commented 4 years ago

@bar0191 really? it still doesn't work for me.. do you use Window? or Linux?

tienthegainz commented 4 years ago

I have proposed a fix in PR #74

This is a fix for me. Ubuntu 18

akihiro-inui commented 4 years ago

I have proposed a fix in PR #74

This fix also worked for me. I use MacOS and run the tests with the virtual environment I created.

ZhiyuanChen commented 4 years ago

Resolved in #84