Closed cristianocca closed 5 years ago
For me it is unclear what the problem is here. start to fail is not enough information.
What do you exactly mean with some entries are not being parsed properly and everything after that ends incomplete. What are the parsing errors? What is incomplete/missing?
Sorry if the report was not clear, I was hoping the samples would be enough to highlight what's going on.
Basically, some characters or input in the source string (which I couldn't find) is causing the whole parsing to stop early, yielding incomplete results. There are no parsing errors, the parsing just "stops".
As you can see in the examples above, the last line of the parsed dict is incomplete (there's no description/summary). In fact, the actual parsing stops there, and the following feed entries are not included neither.
In order to reproduce the issue, you probably need to take the text from above (remove the ) since the original feed has changed and the issue no longer happens. This looks like it was a very specific case that was making the parser go nuts and is no longer happening.
First of all please reformate your json and xml code blocks. wrap the lines. It is unreadable. Because of that it is still unclear what your problem is.
Can't seem to wrap and use code blocks at the same time on the editor. But really, the problem is simple, the two lines with ** ** are the conflicting ones.
First, the parsed result from:
Is parsed into: {'title': "Cynet is offering unhappy competitors' customers a refund for the time remaining on existing contracts", 'title_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': "Cynet is offering unhappy competitors' customers a refund for the time remaining on existing contracts"}}
Which is clearly incomplete. Furthermore, any following item after the broken one is not parsed (doesn't make it to the parsed dict).
Keep in mind that we try to help you in our free- and family-time. Just formated. But wrapped would be easier. Or upload the xml-file as attachment whatever saves our ressources.
First, the parsed result from:
<title>Cynet is offering unhappy competitors' customers a refund for the time remaining on existing contracts</title>Cynet goes head-to-head with CrowdStrike, DarkTrace, Cylance, Carbon Black & Symantec, offering their unhappy customers a refund for the time remaining on their existing contracts. Cynet, the automated threat discovery and mitigation platform was built to address the advanced threats that AV and Fi...http://feedproxy.google.com/~r/TheHackersNews/~3/2kBjOTNiTks/cynet-endpoint-security.htmltag:blogger.com,1999:blog-4802841478634147276.post-799735885797893608Tue, 12 Mar 2019 09:12:54 -0400[info@thehackernews.com](mailto:info@thehackernews.com) (Exclusive Deals)
Is parsed into:
{'title': "Cynet is offering unhappy competitors' customers a refund for the time remaining on existing contracts", 'title_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': "Cynet is offering unhappy competitors' customers a refund for the time remaining on existing contracts"}}
Which is clearly incomplete. Furthermore, any following item after the broken one is not parsed (doesn't make it to the parsed dict).
You wrote there are no errors? You know FeedParserDict.bozo_exception
?
There is a parsing error.
>>> a.bozo_exception
SAXParseException('not well-formed (invalid token)',)
>>> type(a.bozo_exception)
<class 'xml.sax._exceptions.SAXParseException'>
So you have to check for bozo_exceptions.
The issue can be closed.
Interesting. I really didn't see that the parsed results would also contain an exception (instead of throwing it). I guess that's why I never saw it. I'm sorry if I wasted your time.
it is not "wasted" - you learned something.
@cristianocca Back in the old, old days of feedparser, the original author made the decision to never throw exceptions. However, nowadays this breaks people's expectations.
It may be worthwhile to revisit this decision, as it affects people (like yourself) who would expect an exception to be raised if there was a dire problem. Thanks for reporting this!
Using Python 3 (3.6) and feedparser 5.2.1 under ubuntu.
I'm trying to parse a feed that for some reason started to fail recently. After digging a bit, it turns out the failures are because some entries are not being parsed properly and everything after that ends incomplete (i.e., the resulting dict has link and other atributes).
The feed I'm parsing is the following:
data = feedparser.parse('https://thn.li/rss.php')
Since the feed might change from the time you read this, below are the raw text responses, and the parsed dicts.
-- Parsed dict (from data['entries']) --
**This one is incomplete**
--- Raw response ---