Closed TylerHendrickson closed 11 years ago
Sorry, forgot to mention I'm experiencing this issue using the following (all PyPI distributions):
Hmm... I have not seen that before. I'll take a look tonight.
I answered without looking too closely earlier. This happens inside BeautifulSoup and is the correct behavior for XML. "naked" ampersands are not allowed in XML and BeautifulSoup replaces them with escaped ampersands (&
) They will still work correctly when clicked in a browser.
See this SO question regarding the semicolon underscore business which seems to be a "feature/bug" of BeautifuSoup:
I see the "feature/bug" you're talking about and checked out that link (as well as a few others). I see what's going on with the &
business with naked ampersands, which makes sense (kinda, agree with some of the comments that this is way under-documented by BeautifulSoup).
However, I didn't see anything in reference to the issue with underscores- eg. utm_medium
becomes utm;_medium
in the URL that appears within the <a> </a>
tags (not within the href
). Is the cause of this issue the same as that of the ampersands? Are "naked" underscores also not allowed in XML?
Even supposing that is the case, consider the following (from my original post):
Original: utm_campaign=abcd&utm_medium=efgh&utm_source=ijkl
Result: utm_campaign=abcd&utm;_medium=efgh&utm;_source=ijkl
You can see that in the result, only the second and third underscores are affected by semicolons- the first one is fine....
Hey, here's a thought...
I'm really hoping I'm wrong on this one, but could the issue be that either pynliner or BeautifulSoup is incorrectly assuming that &utm;
is some kind of HTML entity? This seems unlikely given the strict adherence to the XML standard, but when I re-read my last post, that possibility jumped out at me.
I don't boast to know every possible HTML entity offhand, so I checked here and here. According to these sources, &utm;
is definitely not an entity.
Edit: Also checked at Wikipedia's List of XML and HTML character entity references. Not there either.
The problem lies in BeautifulSoup and the correct workaround is to escape your ampersands before running it through pynliner (which loads it into BeautifulSoup modifies it then converts it back to text)
Hello,
First of all, I would like to say "Thanks!" for making Pynliner. I started using it for converting email templates with embedded
<style>
tags to inline styles, since many email clients like to strip out embedded CSS. It's been very helpful overall.However, one issue I've discovered is that visible URLs that contain query strings show up looking a bit odd. I've been able to reproduce the issue as-follows:
As you can see, the ampersands in the
href
param get encoded into&
.Additionally, if you look at the URL wrapped within the
<a>
tag, the last two underscores are prefixed with a semicolon, eg.utm_medium
becomesutm;_medium
.Strangely enough, AFAICT, the issue with the underscores only occurs within a query string. For example, if you change all of the instances of
example.com
in the above python code toex_ample.com
, the underscores come out just as they went in.Is this a known issue? Is there any kind of workaround available? Or am I crazy and missing something obvious...?
Thanks!