scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

Incorrect cleaning of <img> tag #92

Closed victormartinez closed 7 years ago

victormartinez commented 8 years ago

Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.

It is expected that the img tag appear in the self-closing way (<img src='github.png' />) but it might appear in this way: <img src='stackoverflow.png'>. In this case, the safehtml cleans the text incorrectly. For example, see the test in the terminal:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my'

IMHO, the output was expected to be my img is <strong>cool</strong>. The same behavior is witnessed with the tag <input>.

Best regards,

ruairif commented 8 years ago

Thanks for pointing this out. This is probably because safehtml doesn't have a way of dealing with self closing tags (img, input, meta, etc.). We will need to look into a suitable way to handle them.

victormartinez commented 7 years ago

@ruairif At first, sorry for being so late to reply the message (it's been crazy days). Second, thanks for creating a PR and solving that. Regardless of not replying to your PR, I checked out the code and liked the way you approached the problem.

Thanks a lot!

DavidPinho commented 7 years ago

Hi guys, currently I'm facing the same problem pointed out by @victormartinez ! Given the fact that the bug was solved in the following commit c2878f117fdd6f7ecce44f9846dc54b6cfceb48f, do you know when a new release will be launched?

Thanks in advance!!

robsonpeixoto commented 7 years ago

I tested here with and without the close tag and worked very well:

>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is </img><b>cool</b>')
'my  img is <strong>cool</strong>'
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my  img is <strong>cool</strong>'