Strange behavior on URLs with query strings

TylerHendrickson commented 11 years ago

Hello,

First of all, I would like to say "Thanks!" for making Pynliner. I started using it for converting email templates with embedded <style> tags to inline styles, since many email clients like to strip out embedded CSS. It's been very helpful overall.

However, one issue I've discovered is that visible URLs that contain query strings show up looking a bit odd. I've been able to reproduce the issue as-follows:

import pynliner
# The original string- a link containing the text of a URL with query strings
>>> my_str = '<a href="http://www.example.com?utm_campaign=abcd&utm_medium=efgh&utm_source=ijkl">http://www.example.com?utm_campaign=abcd&utm_medium=efgh&utm_source=ijkl</a>'
# Convert with pynliner...
>>> pynliner.fromString(my_str)
u'<a href="http://www.example.com?utm_campaign=abcd&amp;utm_medium=efgh&amp;utm_source=ijkl">http://www.example.com?utm_campaign=abcd&utm;_medium=efgh&utm;_source=ijkl</a>'

As you can see, the ampersands in the href param get encoded into &amp.

Additionally, if you look at the URL wrapped within the <a> tag, the last two underscores are prefixed with a semicolon, eg. utm_medium becomes utm;_medium.

Strangely enough, AFAICT, the issue with the underscores only occurs within a query string. For example, if you change all of the instances of example.com in the above python code to ex_ample.com, the underscores come out just as they went in.

Is this a known issue? Is there any kind of workaround available? Or am I crazy and missing something obvious...?

Thanks!

TylerHendrickson commented 11 years ago

Sorry, forgot to mention I'm experiencing this issue using the following (all PyPI distributions):

pynliner 0.4.0
BeautifulSoup 3.2.1
cssutils 0.9.10.

rennat commented 11 years ago

Hmm... I have not seen that before. I'll take a look tonight.

I answered without looking too closely earlier. This happens inside BeautifulSoup and is the correct behavior for XML. "naked" ampersands are not allowed in XML and BeautifulSoup replaces them with escaped ampersands (&) They will still work correctly when clicked in a browser.

See this SO question regarding the semicolon underscore business which seems to be a "feature/bug" of BeautifuSoup:

http://stackoverflow.com/questions/7187744/beautifulsoup-parser-appends-semicolons-to-naked-ampersands-mangling-urls

TylerHendrickson commented 11 years ago

I see the "feature/bug" you're talking about and checked out that link (as well as a few others). I see what's going on with the & business with naked ampersands, which makes sense (kinda, agree with some of the comments that this is way under-documented by BeautifulSoup).

However, I didn't see anything in reference to the issue with underscores- eg. utm_medium becomes utm;_medium in the URL that appears within the <a> </a> tags (not within the href). Is the cause of this issue the same as that of the ampersands? Are "naked" underscores also not allowed in XML?

Even supposing that is the case, consider the following (from my original post): Original: utm_campaign=abcd&utm_medium=efgh&utm_source=ijkl Result: utm_campaign=abcd&utm;_medium=efgh&utm;_source=ijkl

You can see that in the result, only the second and third underscores are affected by semicolons- the first one is fine....

TylerHendrickson commented 11 years ago

Hey, here's a thought...

I'm really hoping I'm wrong on this one, but could the issue be that either pynliner or BeautifulSoup is incorrectly assuming that &utm; is some kind of HTML entity? This seems unlikely given the strict adherence to the XML standard, but when I re-read my last post, that possibility jumped out at me.

I don't boast to know every possible HTML entity offhand, so I checked here and here. According to these sources, &utm; is definitely not an entity.

Edit: Also checked at Wikipedia's List of XML and HTML character entity references. Not there either.

rennat commented 11 years ago

The problem lies in BeautifulSoup and the correct workaround is to escape your ampersands before running it through pynliner (which loads it into BeautifulSoup modifies it then converts it back to text)

rennat / pynliner

Strange behavior on URLs with query strings #21