skrapeit / skrape.it

A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
https://docs.skrape.it
MIT License
815 stars 59 forks source link

[BUG] java.io.IOException: Too many open files - when scraping in a loop #149

Closed matios13 closed 3 years ago

matios13 commented 3 years ago

Describe the bug I am trying to scrape multiple pages in a loop like bellow, but after 231 page i have got java.io.IOException: Too many open files

Code Sample

    for (product in products) {
        val productJson = extract("$mainurl/$product.html")
    }

fun extract(urlToExtract: String): String {
    val extracted = skrape(HttpFetcher) {
        request {
            url = urlToExtract
        }

        extractIt<ScrapedData> {
            htmlDocument {
                val doc = findFirst(".js-model")
                val jsonText = doc.html
                val obj = ProductJsonRepresentation(jsonText)
                it.json = obj.toJson()
            }
        }
    }
    return extracted.json
}

Full stacktrace

10:42:44.472 [client.InternalHttpAsyncClient] I/O reactor terminated abnormally
org.apache.http.nio.reactor.IOReactorException: Failure opening selector
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.<init>(AbstractIOReactor.java:103)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.<init>(BaseIOReactor.java:85)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:321)
    at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
    at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Too many open files
    at java.base/sun.nio.ch.IOUtil.makePipe(Native Method)
    at java.base/sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:83)
    at java.base/sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
Caused by: java.io.IOException: Too many open files

    at java.base/java.nio.channels.Selector.open(Selector.java:295)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.<init>(AbstractIOReactor.java:101)
    ... 5 common frames omitted
christian-draeger commented 3 years ago

Hey Mateusz, thx for pointing this out. it has actually been a bug where the internally used http-clients connection hasn't been closed properly :)

thx to Stefan for finding this little tricky one 👍 Everything should work as expected now, we tried with 10_000 connections in a loop without a problem.

The bugfix is included in release version 1.1.3 - happy coding :)

if you like skrape{it} don't hasitate to give it a star ⭐️