Closed GoogleCodeExporter closed 9 years ago
I am also using ubuntu is that makes a difference.
Original comment by dcart...@gmail.com
on 7 Jul 2010 at 12:02
By the way. the link highsecurityscan.com/?affid=336&subid;=landing, IP
address: 91.212.127.19, ASN: 49087, Description: fake scanner page is a bad
site. DON'T GO TO IT! I am doing a research project on malware.
Original comment by dcart...@gmail.com
on 7 Jul 2010 at 3:01
Can you produce a safe and minimal feed that reproduces the problem. Ideally it
should just have one entry.
Original comment by adewale
on 21 Jul 2010 at 6:26
Can you also try to reproduce the problem with the latest version of Feedparser
from Subversion.
Original comment by adewale
on 21 Jul 2010 at 6:27
Hi adewale,
Thank you for working on this problem. I really appreciate it. So I
downloaded the latest source from svn. Unfortunately, it didn't work. :( So
this is what I did. I edited the previous xml file and removed all the extra
items. Then I edited the description and the domain to a host that I made up.
I also created a test python file (test.py)
In the xml file, it looks like...
<description>Host: removesemicolon.com/?affid=336&subid=landing, Description:
please remove the "amp;" =</description>
After I run test.py, it looks like...
Host: removesemicolon.com/?affid=336&subid=landing, Description: please remove
the "amp;"
I was expecting it to look like...
Host: removesemicolon.com/?affid=336&subid=landing, Description: please remove
the "amp;"
Thank you again,
David
Original comment by dcart...@gmail.com
on 21 Jul 2010 at 3:48
Attachments:
This is odd. Originally the bug report was that feedparser was adding a
spurious semi-colon. However this does not happen with the current version of
the codebase.The summary element looks like this:
'summary': u'Host: highsecurityscan.com/?affid=336&subid=landing, IP address:
91.212.127.19, ASN: 49087, Description: fake scanner page',
However your testcase seems to be asking for feedparser to xmlunescape the
contents of the description. In other words you're asking for "&" to be
converted to "&". In that case you're better of following the advice of:
http://stackoverflow.com/questions/2360598/how-do-i-unescape-html-entities-in-a-
string-in-python-3-1 and using xml.sax.saxutils.unescape:
http://docs.python.org/library/xml.sax.utils.html
Am I correct?
Original comment by adewale
on 31 Jul 2010 at 8:08
Please close this bug.
Tested using svn trunk and both attached sample feeds. Original report (&subid
being turned into an invalid entity reference) is fixed.
@dcart185: the conversion you're seeing now is expected behavior. Adewale is
correct, you can import unescape() from xml.sax.saxutils and use that to
resolve the issue:
from xml.sax.saxutils import unescape
fixed_description = unescape(description)
Original comment by kurtmckee
on 5 Dec 2010 at 6:45
Thank you. I was very new to python at the time. will definitely use it. Thank
you again.
Original comment by dcart...@gmail.com
on 5 Dec 2010 at 8:59
How do you close this?
Original comment by dcart...@gmail.com
on 5 Dec 2010 at 9:00
I'm glad that was helpful. If you run into any more issues with feedparser,
please don't hesitate to file another report!
Sorry, I was addressing adewale in that first line. I think only the maintainer
can close the bug. I have to apologize because as I've been triaging bugs I've
fallen into a terse style that has confounded several other people, too, so I
obviously have to work to improve that! :)
Original comment by kurtmckee
on 5 Dec 2010 at 9:09
Original comment by adewale
on 12 Dec 2010 at 11:19
Original issue reported on code.google.com by
dcart...@gmail.com
on 7 Jul 2010 at 12:01Attachments: