janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

Incomplete extraction of text with special characters #69

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hello Boilerpipe,

When ArticleExtractor.INSTANCE.getText(url) is called for a web page that has a 
code (like below) the function does not return the whole text.

The expected returned text [1] is the one extracted by the web boilerpipe.

The same result happens with versions 1.1.0 and 1.2.0. How can I have the 
complete text extracted by the library as the web boilerpipe does?

Steps to reproduce the problem:
1. url = 
http://supplesoftware.wordpress.com/2009/07/01/make-sure-you-get-all-your-messag
es-in-your-scala-code/
2. ArticleExtractor.INSTANCE.getText(url)
3. Returned incomplete text:
"Make sure you get all your messages in your Scala code 01Jul09 I had this 
funny little Scala actor related bug today. Imagine you want to process a few 
things in parallel. So you go: val processor = self jobs.foreach { job =>   
actor {     processor ! (job.id, job.run)   } } // Merge results for (i <- 1 to 
jobs.size) {   self.receiveWithin(1000) {     case (jobId:Int, 
result:JobResult) => mergeResult(result)   } } All good. Then you realise that 
one or more of the jobs may fail with an exception. Which you have to handle 
somehow. So, you think, you’ll break on the first exception and report back. 
So you change that to: val processor = self jobs.foreach { job =>   actor {    
try {      processor  !  (job.id, job.run)    } catch {      case ex:Throwable 
=> processor  !  (job.id, ex, job)    } } } // Merge results for (i <- 1 to 
jobs.size) {   self.receiveWithin(1000) {     case (jobId:Int, 
result:JobResult) => mergeResult(result)     case(jobId:Int, ex:Throwable, 
job:Job) => throw new RuntimeException("Job " + jobId + " failed", ex)   } } 
Cool. You fail on the first one – you just blow up and report back to your 
caller that something went wrong. Great! Well, not so…. Because the other 
jobs you started are still going to send you their results. You’ve stopped 
that thread by throwing an exception, so there’s nothing to receive those 
messages. When the web-server (in this case) reuses that thread, it will be 
sent all those messages. It won’t actually receive them until it hits the 
receiveWithin methond, so when you are expecting the return from the freshly 
started actors, you will actually be getting the messages from the actors that 
broke while servicing the last request. Kind of undesirable really. One thing 
you can do is wait for all the messages. Take a response back regardless of 
what it is. This is what I did. Here’s the example: for (rowNumber <- 1 to 
mainRowsCount) { self.receiveWithin(2000) { "

[1] 
http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fsupplesoftware.wordpr
ess.com%2F2009%2F07%2F01%2Fmake-sure-you-get-all-your-messages-in-your-scala-cod
e%2F&extractor=ArticleExtractor&output=text&extractImages=

Thank you,
Alexandre Cançado Cardoso

Original issue reported on code.google.com by acc.intr...@gmail.com on 24 Sep 2013 at 8:28