grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 787 forks source link

Non-obvious failure grabbing top_image #240

Open Slater-Victoroff opened 9 years ago

Slater-Victoroff commented 9 years ago

I'm running goose on a few urls and it returns an empty string on this one in particular: http://shop.nordstrom.com/s/lancer-skincare-sheer-fluid-sun-shield-spf-30/3565107

I guess the thing that confuses me here is that it seems like a pretty typical url, and is almost identical to a number of pages that are handled correctly. Is it possible for someone to point me to the relevant part of the code-base so I can put up a PR?

Slater-Victoroff commented 9 years ago

Looking into this a little more deeply and it looks like an issue with the parser, despite the fact that the main image is in an img tag, the parser can't pick it up, going to see if I can get a version working with raw requests + lxml, then try to integrate findings back in.

Slater-Victoroff commented 9 years ago

Looked into this a bit, and also found that the way images are handled now is... kind of silly. Shouldn't download the whole image just for dimensions. The first 24 bits will contain all of the relevant metadata to make a decision (See: http://stackoverflow.com/questions/7460218/get-image-size-without-downloading-it-in-python)

Will try to make my way over to fix that once I get to the bottom of this first issue.

Also it looks like there's something bugged with the parser implementation. requests + lxml works just fine.