rszaloki / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

2 to 3 mins taken for a some URLs #6

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Modified the demo code
2.Compile with following command

javac -cp boilerpipe-1.0.4.jar;lib/nekohtml-1.9.13.jar;lib/xerces-2.9.1.jar
Oneliner.java

3.Run with following command

java -cp
.;boilerpipe-1.0.4.jar;lib/nekohtml-1.9.13.jar;lib/xerces-2.9.1.jar Oneliner

What is the expected output? What do you see instead?
I am satisfied with output but time consumption is not considerable.

What version of the product are you using? On what operating system?
boilerpipe-1.0.4 under Window XP 

Please provide any additional information below.
I have attached the modified source code

Original issue reported on code.google.com by muruganp...@gmail.com on 11 May 2010 at 2:47

Attachments:

GoogleCodeExporter commented 9 years ago
The URL in question: 
http://www.infoworld.com/d/networking/gartner-10-mobile-wireless-technologies-
should-be-your-radar-075?source=rss_networking

Hi muruganprofmail,

the delay seems to be a problem strongly related to the Java HTTP Client, which 
boilerpipe only uses for 
demonstration purposes, and related to infoworld.com only (see thread dump 
below).

As documented, the method DefaultExtractor.INSTANCE.getText(URL) is for mainly 
demonstration purposes. 
Boilerpipe is not a crawler. 

Try retrieving the HTML content of the infoworld.com page using your browser 
(or curl, wget, Apache 
HttpClient etc.), save it to dis (or provide it as an InputSource) and re-try 
the demo code (i.e., use a file:// 
URI). I got the results after 30 milliseconds.

Extract of a thread dump generated by KILL -QUIT <pid>

"main" prio=5 tid=0x0000000101800800 nid=0x100501000 runnable 
[0x0000000100500000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
    - locked <0x000000010546a398> (a java.io.BufferedInputStream)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1072)
    - locked <0x00000001054562b0> (a sun.net.www.protocol.http.HttpURLConnection)
    at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2173)
    at java.net.URLConnection.getContentEncoding(URLConnection.java:496)
    at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:91)
    at de.l3s.boilerpipe.demo.Oneliner.main(Oneliner.java:36)

Original comment by ckkohl79 on 11 May 2010 at 3:07