ngtignacio / java-wikipedia-parser

Automatically exported from code.google.com/p/java-wikipedia-parser
0 stars 0 forks source link

Missing text in parsed output. #9

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. Certain text is not correctly output during parsing. For example, the text 
in the HTML shown below (from the page for the year 1979) is not correctly 
extracted. It appears there is a problem dealing with certain anchor tags 
(problem with a regular expression?).

What is the expected output? What do you see instead?

For this code...

<li><a href="/wiki/May_27" title="May 27">May 27</a> – <a 
href="/wiki/1979_Indianapolis_500" title="1979 Indianapolis 500">Indianapolis 
500</a>: <a href="/wiki/Rick_Mears" title="Rick Mears">Rick Mears</a> wins the 
race for the first time, and car owner <a href="/wiki/Roger_Penske" 
title="Roger Penske">Roger Penske</a> for the second time.</li>

The extracted text is: *   wins the race for the first time, and car owner 
Roger Penske for the second time.
Instead of: * May 27 – Indianapolis 500: Rick Mears wins the race for the 
first time, and car owner Roger Penske for the second time.

And...for this code:
...
<li>The <a href="/wiki/United_States" title="United States">United States</a> 
and the <a href="/wiki/People%27s_Republic_of_China" title="People's Republic 
of China">People's Republic of China</a> establish full <a 
href="/wiki/Sino-American_relations" title="Sino-American relations">diplomatic 
relations</a>.</li>
...

The extracted text is: diplomatic relations.
Instead of: * The United States and the People's Republic of China establish 
full diplomatic relations.

Cheers

Original issue reported on code.google.com by andre.bi...@gmail.com on 25 May 2011 at 2:21