seanjensengrey / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Encoding problem? – Strange garbage introduced #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
public class Oneliner
{
  public static void main(final String[] args) throws Exception
  {
    final URL url = new URL("http://a2zmacau.com/1284/ao-man-long-tells-macau-court-he-
did-receive-bribes/");

    // This can also be done in one line:
    System.out.println(DefaultExtractor.INSTANCE.getText(url));
  }
}

gives

The former secretary for transport and public works, Ao Man Long who took a 
cool US$100 
million in bribes and now is serving a 27-year jail sentence for serious 
corruption charges, 
admitted yesterday to having received money from companies including Seng Meng 
Fai.
Ao was a witness in his family’s trial and rejected claims that his relatives 
and wife had 
knowledge of what the former secretary was doing. Ao also told the court his 
family did what he 
asked without ever questioning him or the activities involving grand sums of 
money and offshore 
accounts.
The court repeatedly heard how the former secretary’s family trusted Ao and 
his decisions.
However, Ao confessed to receiving large sums of money, but said it had not 
been in the way 
described in the indictment against him.
The payments were made in increments for services provided to those companies, 
however they 
did not affect the outcome or the process of the public tenders and winning 
bidders, the court 
heard.
The court also heard that Ho Meng Fai had made payments to bank accounts under 
Ao’s family 
members’ names, but were managed by the former secretary. The money was not 
related to 
bribery nor was it related to corruption, Ao told the court.
The money was “simplyâ€� for services Ecoline, one of Ao’s shell 
companies, had carried 
out, the former secretary said, adding that for the Macau Dome, Ho Meng Fai had 
sought 
services from Ecoline to contact a projects concession company from the 
mainland.
The court heard that this was an example of the types of services Ecoline 
carried out.
Ao also said that this time, unlike previously, he was telling the truth. But 
he was unable to 
itemise all the works where such services and payments were made, saying that 
the prosecution 
would have to ask the deceased Lee Se Chong, who had all the companies’ 
contacts.
The court also heard that Ao had only had access to Ecoline in 2006 after 
the  manager Lee Se 
Chong died.
Related Websites
Leave a reply
Search For Macau Hotels

Please notice "after the  manager". The HTML of this part is very simple,

 <p>... Ecoline in 2006 after the  manager Lee Se Chong died.</p>

but contains two consecutive spaces.

Hope this helps to improve your tool, which looks quite good.

Kaspar

Original issue reported on code.google.com by kaspar.f...@gmail.com on 7 Jan 2010 at 6:50

GoogleCodeExporter commented 9 years ago
(There also seems to be a problem with quotes, see "“simply�".)

Original comment by kaspar.f...@gmail.com on 7 Jan 2010 at 7:02

GoogleCodeExporter commented 9 years ago
Hi Kaspar,

thanks for this report.

The "strange garbage" is actually what you get when incorrectly setting the 
input encoding. In this case, the 
text was UTF-8, but it was treated as Latin-1.

When calling Extractor.getText(URL), we relied upon NekoHTML to find <META 
HTTP-EQUIV="Content-
Type"> tags even when passing a Reader instead of an InputStream. Unfortunately 
that didn't work...

I have fixed it in SVN. Could please check out ExtractorBase from trunk and see 
if it works for you?

Best,
Christian

Original comment by ckkohl79 on 24 Jan 2010 at 3:43

GoogleCodeExporter commented 9 years ago
Hi Christian,

Thanks for fixing this. Works like a charm.

Best,
Kaspar

Original comment by kaspar.f...@gmail.com on 24 Jan 2010 at 3:53

GoogleCodeExporter commented 9 years ago

Original comment by ckkohl79 on 24 Jan 2010 at 4:10