feedparser adds a semi colon to the description when nothing should be added

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

t =feedparser.parse("http://www.malwaredomainlist.com/hostslist/mdl.xml")
for item in t.entries:
    print item.description

What is the expected output? What do you see instead?

Host: highsecurityscan.com/?affid=336&subid;=landing, IP address: 
91.212.127.19, ASN: 49087, Description: fake scanner page

where it says "&subid;=landing" a semicolon is added.  Not on the original page.

should look like this..
Host: highsecurityscan.com/?affid=336&subid=landing

What version of the product are you using? On what operating system?

4.1-14

Please provide any additional information below.
when I downloaded the text version (which is included in the issue)
it looks like 
<description>Host: highsecurityscan.com/?affid=336&amp;subid=landing
it seems as if the semicolon is moved after subid after being parsed by 
feedparser

Original issue reported on code.google.com by dcart...@gmail.com on 7 Jul 2010 at 12:01

Attachments:

mdl.xml

GoogleCodeExporter commented 9 years ago

I am also using ubuntu is that makes a difference.

Original comment by dcart...@gmail.com on 7 Jul 2010 at 12:02

GoogleCodeExporter commented 9 years ago

By the way. the link highsecurityscan.com/?affid=336&subid;=landing, IP 
address: 91.212.127.19, ASN: 49087, Description: fake scanner page is a bad 
site.  DON'T GO TO IT! I am doing a research project on malware.

Original comment by dcart...@gmail.com on 7 Jul 2010 at 3:01

GoogleCodeExporter commented 9 years ago

Can you produce a safe and minimal feed that reproduces the problem. Ideally it 
should just have one entry.

Original comment by adewale on 21 Jul 2010 at 6:26

GoogleCodeExporter commented 9 years ago

Can you also try to reproduce the problem with the latest version of Feedparser 
from Subversion.

Original comment by adewale on 21 Jul 2010 at 6:27

GoogleCodeExporter commented 9 years ago

Hi adewale,

Thank you for working on this problem.  I really appreciate it.  So I 
downloaded the latest source from svn.  Unfortunately, it didn't work.  :(  So 
this is what I did.  I edited the previous xml file and removed all the extra 
items.  Then I edited the description and the domain to a host that I made up.  
I also created a test python file (test.py)

In the xml file, it looks like...

<description>Host: removesemicolon.com/?affid=336&subid=landing, Description: 
please remove the "amp;" =</description>

After I run test.py, it looks like...

Host: removesemicolon.com/?affid=336&subid=landing, Description: please remove 
the "amp;"

I was expecting it to look like...

Host: removesemicolon.com/?affid=336&subid=landing, Description: please remove 
the "amp;"

Thank you again,

David

Original comment by dcart...@gmail.com on 21 Jul 2010 at 3:48

Attachments:

GoogleCodeExporter commented 9 years ago

This is odd. Originally the bug report was that feedparser was adding a 
spurious semi-colon. However this does not happen with the current version of 
the codebase.The summary element looks like this:
'summary': u'Host: highsecurityscan.com/?affid=336&subid=landing, IP address: 
91.212.127.19, ASN: 49087, Description: fake scanner page',

However your testcase seems to be asking for feedparser to xmlunescape the 
contents of the description. In other words you're asking for "&" to be 
converted to "&". In that case you're better of following the advice of: 
http://stackoverflow.com/questions/2360598/how-do-i-unescape-html-entities-in-a-
string-in-python-3-1 and using xml.sax.saxutils.unescape: 
http://docs.python.org/library/xml.sax.utils.html

Am I correct?

Original comment by adewale on 31 Jul 2010 at 8:08

GoogleCodeExporter commented 9 years ago

Please close this bug.

Tested using svn trunk and both attached sample feeds. Original report (&subid 
being turned into an invalid entity reference) is fixed.

@dcart185: the conversion you're seeing now is expected behavior. Adewale is 
correct, you can import unescape() from xml.sax.saxutils and use that to 
resolve the issue:

from xml.sax.saxutils import unescape
fixed_description = unescape(description)

Original comment by kurtmckee on 5 Dec 2010 at 6:45

GoogleCodeExporter commented 9 years ago

Thank you. I was very new to python at the time. will definitely use it.  Thank 
you again.

Original comment by dcart...@gmail.com on 5 Dec 2010 at 8:59

GoogleCodeExporter commented 9 years ago

How do you close this?

Original comment by dcart...@gmail.com on 5 Dec 2010 at 9:00

GoogleCodeExporter commented 9 years ago

I'm glad that was helpful. If you run into any more issues with feedparser, 
please don't hesitate to file another report!

Sorry, I was addressing adewale in that first line. I think only the maintainer 
can close the bug. I have to apologize because as I've been triaging bugs I've 
fallen into a terse style that has confounded several other people, too, so I 
obviously have to work to improve that! :)

Original comment by kurtmckee on 5 Dec 2010 at 9:09

GoogleCodeExporter commented 9 years ago

Original comment by adewale on 12 Dec 2010 at 11:19

Changed state: Fixed

pombreda / feedparser

feedparser adds a semi colon to the description when nothing should be added #223