Closed victormartinez closed 7 years ago
Thanks for pointing this out. This is probably because safehtml doesn't have a way of dealing with self closing tags (img, input, meta, etc.). We will need to look into a suitable way to handle them.
@ruairif At first, sorry for being so late to reply the message (it's been crazy days). Second, thanks for creating a PR and solving that. Regardless of not replying to your PR, I checked out the code and liked the way you approached the problem.
Thanks a lot!
Hi guys, currently I'm facing the same problem pointed out by @victormartinez ! Given the fact that the bug was solved in the following commit c2878f117fdd6f7ecce44f9846dc54b6cfceb48f, do you know when a new release will be launched?
Thanks in advance!!
I tested here with and without the close tag and worked very well:
>>> from scrapely.extractors import safehtml, htmlregion
>>> t = lambda s: safehtml(htmlregion(s))
>>> t('my <img href="http://fake.url"> img is </img><b>cool</b>')
'my img is <strong>cool</strong>'
>>> t('my <img href="http://fake.url"> img is <b>cool</b>')
'my img is <strong>cool</strong>'
Hi guys, I was looking for a html cleaner and found it inside the Scrapely lib. After some trials, I found a bug that I believe is critical.
It is expected that the
img
tag appear in the self-closing way (<img src='github.png' />
) but it might appear in this way:<img src='stackoverflow.png'>
. In this case, thesafehtml
cleans the text incorrectly. For example, see the test in the terminal:IMHO, the output was expected to be
my img is <strong>cool</strong>
. The same behavior is witnessed with the tag<input>
.Best regards,