Open GoogleCodeExporter opened 9 years ago
Can you provide a URL to a feed demonstrating this issue, or would you create
an attachment to this bug report? At first blush this just looks like
feedparser's HTML sanitizer at work. You can disable it using code like:
feedparser.SANITIZE_HTML = 0
Let me know if that works!
Original comment by kurtmckee
on 28 Apr 2012 at 9:03
I tried the HTML sanitizer as well and although it did capture the first <sec>
tag, the parsed data was empty.
My workaround was to use lmxl and stip out the <sec> tags for the ['content']
keys.
Sample entry is below:
<item>
<title>Lost Economic Time: The Proust Index</title>
<link>http://www.example.com</link>
<dc:contributor>The summary was prepared by Mark A. Harrison</dc:contributor>
<category>May Articles</category>
<original-article>Economist (25 February 2012)</original-article>
<description>
<p>The Digest May 2012.</p>
<p>
The <em>Economist</em> has constructed a measure to determine how much economic
progress has been undone by the financial crisis. The measure considers seven
indicators of economic health: GDP, stock markets, consumption, wages, house
prices, wealth, and unemployment. The results indicate that economic progress,
or time, that has been lost will not be easily regained for most countries.
</p>
</description>
<content:encoded>
<sec>
<title>What’s Inside?</title>
<p>
According to the<em>Economist</em>’s measure of lost time, Greece’s
economic clock has been turned back 12 years. Ireland, Italy, Portugal, and
Spain have lost seven years, and Britain has lost eight. The United States has
lost 10 years.
</p>
<p>
The measure uses three broad categories: indicators of household wealth,
consisting of financial asset prices and property prices; indicators of annual
output and private consumption; and indicators of real wages and unemployment.
The averages of time lost in each category are then added together for the
overall measure.
</p>
<p>
Stock markets, as forward-looking indicators of expected returns, have
historically recovered quickly. Wages seem not to have fallen back as far as
other areas. House prices, with a few exceptions, remain at levels similar to
those of a decade or more ago. Nominal GDP is burdened by debts set at high
values during the boom that are not being made more manageable by growth and
inflation.
</p>
<p>
Real GDP per person, a better indicator of consumer economic health, suggests
one-third of the countries the IMF has data for are poorer than they were in
2007, each losing five years or more. In the EU, 22 of its 27 members have lost
time based on real GDP per person. Of the G–7 countries, only Germany has not
taken a step back. The reduction in unemployment that many advanced economies
had made before the crisis has been undone, with unemployment now reaching
levels that countries have not seen for 10–20 years.
</p>
</sec>
<sec>
<title>How Is This Article Useful to Practitioners?</title>
<p>
For investors trying to make sense of such a measure to guide their allocation
of investment capital, the coincidence of so many extraneous drivers
influencing the data, which econometricians might call exogenous, as well as
endogenous influences within each indicator probably undermine its usefulness,
except perhaps as a footnote to the <em>Economist’</em>s own “Misery
Index.” Any time series can also suffer from statistical problems, such as
serial correlation or the selection of a misrepresentative time period.
</p>
</sec>
<sec>
<title>Abstractor’s Viewpoint</title>
<p>
In those lost years, commodities were mined, technologies advanced, patents
lodged, infrastructure and buildings—including millions of new
homes—improved, and human capital developed. Rather than a decade lost, the
clock has more likely been reset back to where it was before bank lending
spiraled unsustainably upward. Perhaps more valuable to investors would be
knowledge of defective financial regulations, fiscal and monetary arrangements,
and investment instruments and techniques that were acquired in those lost
years to help prevent any repetition.
</p>
</sec>
</content:encoded>
<pubDate>Thu, 01 Apr 2012 08:00:00 +0000</pubDate>
<guid>http://www.example.com</guid>
</item>
Original comment by jason....@willowtreeapps.com
on 2 May 2012 at 2:24
Please provide a URL to the feed that's using this structure. I'd like to have
an example of this in the wild to make sure that nothing's being lost when it's
copied-and-pasted into this bug report. Thanks!
Original comment by kurtmckee
on 10 May 2012 at 4:40
My apologies for getting back so late on this.
Here's live feed: http://www.cfai.mobi/Mobile/Digest_RSS.xml
Good example would be entry with title "Lost Economic Time: The Proust Index"
-- Thanks!
Original comment by jason....@willowtreeapps.com
on 14 May 2012 at 3:58
Thanks! I've whittled the file down to a single entry that demonstrates the
issue.
Original comment by kurtmckee
on 18 May 2012 at 3:56
Attachments:
Original issue reported on code.google.com by
jason....@willowtreeapps.com
on 27 Apr 2012 at 2:40