jsumners / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

Not parsing <sec> tags within <content:encoded> tags #349

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run feedparser on XML feed 
2. Entry contains:
     <content:encoded>
          <sec>
               <title>Something</title>
               <p>Article text here</p>
          </sec>
          <sec>
               <title>Related Title</title>
               <p>Copy of the second entry</p>
               <p>Additional text</p>
          </sec>
     </content:encoded>

3. Feedparser returns 'content' as ' ' 

What is the expected output? What do you see instead?
Would expect to see full content returned in 'content' - instead feedparser 
skips the <sec> tags and returns results as an empty string. 

What version of the product are you using? On what operating system?
Python 2.7.2 on Mac OS X 10.7.3 with feedparser 5.1.1

Please provide any additional information below.

Original issue reported on code.google.com by jason....@willowtreeapps.com on 27 Apr 2012 at 2:40

GoogleCodeExporter commented 9 years ago
Can you provide a URL to a feed demonstrating this issue, or would you create 
an attachment to this bug report? At first blush this just looks like 
feedparser's HTML sanitizer at work. You can disable it using code like:

    feedparser.SANITIZE_HTML = 0

Let me know if that works!

Original comment by kurtmckee on 28 Apr 2012 at 9:03

GoogleCodeExporter commented 9 years ago
I tried the HTML sanitizer as well and although it did capture the first <sec> 
tag, the parsed data was empty. 
My workaround was to use lmxl and stip out the <sec> tags for the ['content'] 
keys. 

Sample entry is below:

<item>
<title>Lost Economic Time: The Proust Index</title>
<link>http://www.example.com</link>
<dc:contributor>The summary was prepared by Mark A. Harrison</dc:contributor>
<category>May Articles</category>
<original-article>Economist (25 February 2012)</original-article>
<description>
<p>The Digest May 2012.</p>
<p>
The <em>Economist</em> has constructed a measure to determine how much economic 
progress has been undone by the financial crisis. The measure considers seven 
indicators of economic health: GDP, stock markets, consumption, wages, house 
prices, wealth, and unemployment. The results indicate that economic progress, 
or time, that has been lost will not be easily regained for most countries.
</p>
</description>
<content:encoded>
<sec>
<title>What’s Inside?</title>
<p>
According to the<em>Economist</em>’s measure of lost time, Greece’s 
economic clock has been turned back 12 years. Ireland, Italy, Portugal, and 
Spain have lost seven years, and Britain has lost eight. The United States has 
lost 10 years.
</p>
<p>
The measure uses three broad categories: indicators of household wealth, 
consisting of financial asset prices and property prices; indicators of annual 
output and private consumption; and indicators of real wages and unemployment. 
The averages of time lost in each category are then added together for the 
overall measure.
</p>
<p>
Stock markets, as forward-looking indicators of expected returns, have 
historically recovered quickly. Wages seem not to have fallen back as far as 
other areas. House prices, with a few exceptions, remain at levels similar to 
those of a decade or more ago. Nominal GDP is burdened by debts set at high 
values during the boom that are not being made more manageable by growth and 
inflation.
</p>
<p>
Real GDP per person, a better indicator of consumer economic health, suggests 
one-third of the countries the IMF has data for are poorer than they were in 
2007, each losing five years or more. In the EU, 22 of its 27 members have lost 
time based on real GDP per person. Of the G–7 countries, only Germany has not 
taken a step back. The reduction in unemployment that many advanced economies 
had made before the crisis has been undone, with unemployment now reaching 
levels that countries have not seen for 10–20 years.
</p>
</sec>
<sec>
<title>How Is This Article Useful to Practitioners?</title>
<p>
For investors trying to make sense of such a measure to guide their allocation 
of investment capital, the coincidence of so many extraneous drivers 
influencing the data, which econometricians might call exogenous, as well as 
endogenous influences within each indicator probably undermine its usefulness, 
except perhaps as a footnote to the <em>Economist’</em>s own “Misery 
Index.” Any time series can also suffer from statistical problems, such as 
serial correlation or the selection of a misrepresentative time period.
</p>
</sec>
<sec>
<title>Abstractor’s Viewpoint</title>
<p>
In those lost years, commodities were mined, technologies advanced, patents 
lodged, infrastructure and buildings—including millions of new 
homes—improved, and human capital developed. Rather than a decade lost, the 
clock has more likely been reset back to where it was before bank lending 
spiraled unsustainably upward. Perhaps more valuable to investors would be 
knowledge of defective financial regulations, fiscal and monetary arrangements, 
and investment instruments and techniques that were acquired in those lost 
years to help prevent any repetition.
</p>
</sec>
</content:encoded>
<pubDate>Thu, 01 Apr 2012 08:00:00 +0000</pubDate>
<guid>http://www.example.com</guid>
</item>

Original comment by jason....@willowtreeapps.com on 2 May 2012 at 2:24

GoogleCodeExporter commented 9 years ago
Please provide a URL to the feed that's using this structure. I'd like to have 
an example of this in the wild to make sure that nothing's being lost when it's 
copied-and-pasted into this bug report. Thanks!

Original comment by kurtmckee on 10 May 2012 at 4:40

GoogleCodeExporter commented 9 years ago
My apologies for getting back so late on this. 

Here's live feed: http://www.cfai.mobi/Mobile/Digest_RSS.xml

Good example would be entry with title "Lost Economic Time: The Proust Index"

-- Thanks!

Original comment by jason....@willowtreeapps.com on 14 May 2012 at 3:58

GoogleCodeExporter commented 9 years ago
Thanks! I've whittled the file down to a single entry that demonstrates the 
issue.

Original comment by kurtmckee on 18 May 2012 at 3:56

Attachments: