Closed GoogleCodeExporter closed 8 years ago
The URL in question:
http://www.infoworld.com/d/networking/gartner-10-mobile-wireless-technologies-
should-be-your-radar-075?source=rss_networking
Hi muruganprofmail,
the delay seems to be a problem strongly related to the Java HTTP Client, which
boilerpipe only uses for
demonstration purposes, and related to infoworld.com only (see thread dump
below).
As documented, the method DefaultExtractor.INSTANCE.getText(URL) is for mainly
demonstration purposes.
Boilerpipe is not a crawler.
Try retrieving the HTML content of the infoworld.com page using your browser
(or curl, wget, Apache
HttpClient etc.), save it to dis (or provide it as an InputSource) and re-try
the demo code (i.e., use a file://
URI). I got the results after 30 milliseconds.
Extract of a thread dump generated by KILL -QUIT <pid>
"main" prio=5 tid=0x0000000101800800 nid=0x100501000 runnable
[0x0000000100500000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x000000010546a398> (a java.io.BufferedInputStream)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1072)
- locked <0x00000001054562b0> (a sun.net.www.protocol.http.HttpURLConnection)
at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:2173)
at java.net.URLConnection.getContentEncoding(URLConnection.java:496)
at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:91)
at de.l3s.boilerpipe.demo.Oneliner.main(Oneliner.java:36)
Original comment by ckkohl79
on 11 May 2010 at 3:07
Original issue reported on code.google.com by
muruganp...@gmail.com
on 11 May 2010 at 2:47Attachments: