hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
855 stars 174 forks source link

Fix Google Image parser #102

Closed taoyudong closed 3 years ago

taoyudong commented 3 years ago

The original Google Image parser fails to get the correct image URLs, which triggers the 404 error.

The original regex is "http.*?.(?:jpg|png|bmp)" to parse the image URL is An example match is "https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcSgI8sYlvf_yqwiRKl4fm7Pzej5Sehs-yVEQG-cHKmFKVp2YSDdImymVCH-T_zXJdCJwnY\u0026usqp\u003dCAU",159,318],["https://icatcare.org/app/uploads/2018/07/Thinking-of-getting-a-cat.png"

Apparently, the match starts with an unanticipated URL before the correct one. An easy fix to that is to exclude "[" in the substring between these two URLs. The square brackets are not used in common URLs and are replaced with %5B. So this fix should work in general.

taoyudong commented 3 years ago

Related to #95

ZhiyuanChen commented 3 years ago

Thank you!