plar / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

IllegalArgumentException for many web pages #78

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
With boilerpipe-1.2.0.jar
ArticleExtractor.INSTANCE.getText(new java.net.URL("http://t.co/3RplOLjc"))
produces
ERROR java.lang.IllegalArgumentException:
protocol = http host = null
        at de.l3s.boilerpipe.sax.HTMLFetcher.fetch (HTMLFetcher.java:33)
        at de.l3s.boilerpipe.extractors.ExtractorBase.getText (ExtractorBase.java:87)

This happens for many other URLs e.g. http://t.co/5vuYimwn http://t.co/Dy5yolLs 
http://t.co/ShWhtFjP http://nyti.ms/lQrWwp ...

Original issue reported on code.google.com by johann.petrak@gmail.com on 22 Aug 2014 at 3:23