tsepton / nuforc-data-scrapper

Scrapper and data from the NUFORC website
The Unlicense
0 stars 0 forks source link

Error on scrapping #1

Open tsepton opened 2 years ago

tsepton commented 2 years ago

Not happening every time

[error] java.io.IOException: Underlying input stream returned zero bytes
[error]         at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:288)
[error]         at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
[error]         at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
[error]         at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
[error]         at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
[error]         at java.base/java.io.BufferedReader.read1(BufferedReader.java:212)
[error]         at java.base/java.io.BufferedReader.read(BufferedReader.java:290)
[error]         at org.jsoup.parser.CharacterReader.bufferUp(CharacterReader.java:87)
[error]         at org.jsoup.parser.CharacterReader.current(CharacterReader.java:235)
[error]         at org.jsoup.parser.TokeniserState$1.read(TokeniserState.java:12)
[error]         at org.jsoup.parser.Tokeniser.read(Tokeniser.java:59)
[error]         at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:82)
[error]         at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:57)
[error]         at org.jsoup.parser.Parser.parseInput(Parser.java:49)
[error]         at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:216)
[error]         at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:962)
[error]         at net.ruippeixotog.scalascraper.browser.JsoupBrowser.doc$lzyINIT1$1(JsoupBrowser.scala:88)
[error]         at net.ruippeixotog.scalascraper.browser.JsoupBrowser.doc$1(JsoupBrowser.scala:88)
[error]         at net.ruippeixotog.scalascraper.browser.JsoupBrowser.processResponse(JsoupBrowser.scala:90)
[error]         at net.ruippeixotog.scalascraper.browser.JsoupBrowser.$init$$$anonfun$4(JsoupBrowser.scala:97)
[error]         at scala.Function1.$anonfun$andThen$1(Function1.scala:85)
[error]         at net.ruippeixotog.scalascraper.browser.JsoupBrowser.get(JsoupBrowser.scala:39)
[error]         at Scrapper$.downloadReportsFromPage(Scrapper.scala:39)
[error]         at Scrapper$.$anonfun$1(Scrapper.scala:12)
[error]         at scala.collection.immutable.List.map(List.scala:250)
[error]         at Scrapper$.getReports(Scrapper.scala:12)
[error]         at Main$package$.downloadAndSaveReports(Main.scala:16)
[error]         at Main$package$.main(Main.scala:8)
[error]         at main.main(Main.scala:8)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] stack trace is suppressed; run last Compile / run for the full output
[error] (Compile / run) java.io.IOException: Underlying input stream returned zero bytes
[error] Total time: 24 s, completed Aug 17, 2022, 4:39:06 PM

Probably due to the nuforc server closing the connection.

Solution : write a wrapper around browser.get(...) to avoid its side effects

tsepton commented 2 years ago

Adding a time off between each request should also do the trick