net: concurrent processing of large response bodies results in `substr(start, length) out of bounds`

ttytm commented 1 year ago

Describe the bug

Although this bug isn't always present, its probability of occurring rises with the number of items we add. In the case of 5 requests in the reproduction, the likelihood of encountering this bug is very high. So you probably won't need to run it repeatedly.

Expected Behavior

Parsing works.

Current Behavior

V panic: substr(7, 1047477) out of bounds (len=545055)
# ...
005f48d2 : at ???: RUNTIME ERROR: invalid memory access

Reproduction Steps

module main

import net.http
import net.html
import sync.pool

fn main() {
    mut pp := pool.new_pool_processor(callback: get_news)
    pp.work_on_items(['inab', 'cyto', 'top', 'tsla', 'tela'])
}

fn get_news(mut pp pool.PoolProcessor, idx int, wid int) voidptr {
    ticker := pp.get_item[string](idx)

    resp := http.get('https://finance.yahoo.com/quote/${ticker}/press-releases') or {
        eprintln('Initial request failed for: ${ticker} ${err}')
        return pool.no_result
    }
    if resp.status_code != 200 {
        eprintln('Failed requesting press release: ${resp.status}')
        return pool.no_result
    }
    mut doc := html.parse(resp.body)
    /*
    further process doc
    ...*/

    return pool.no_result
}

Possible Solution

Go like reader interface for response bodies.

No response

Additional Information/Context

The above code has been reduced to not contain headers and workarounds which help us to avoid another common issue we are encountering with concurrent requests: SSL handshake failed or response does not start with HTTP/, line:. So running the above reproduction is likely to trigger such errors as well.

As a workaround, we usually add custom headers and wrap the requests in an extended try_request function that retries failed requests due to SSL handshake errors. This is a problem that occurs not only when scraping, but also when working with, for example, the Github api. But out of these the main bug of this issue substr() out of bounds is the most fatal an not recoverable.

V version

0.3.4

Environment details (OS name and version, etc.)

Linux, arch.

Casper64 commented 1 year ago

How large is the http response? I remember seeing a similar issue that arised when strings of about 100k+ characters were parsed.

ttytm commented 1 year ago

Thanks for taking the time to confirm this @Casper64.

And yes, responses in the used example above are usually more than 100k characters.

Edit: What I'm noticing after some exploratory testing is that limiting the length of the response reduces the failure rate. But the error still occurs when the length is about 15k characters. Turns out there is already an earlier out of bounds error just trying to assign the response body, prior to parsing.

_ := resp.body

Has potential for the same error with large response bodies.

ttytm commented 9 months ago

error output with V 4.2 (ticker names vary per run):

Initial request failed for: inab response does not start with HTTP/, line: ``
Initial request failed for: tela invalid chunksize
Initial request failed for: tsla invalid chunksize

spytheman commented 9 months ago

Try with -d net_blocking_sockets -d use_openssl . For me, this:

import net.http
import sync.pool

fn get_news(mut pp pool.PoolProcessor, idx int, wid int) voidptr {
    ticker := pp.get_item[string](idx)

    resp := http.get('https://finance.yahoo.com/quote/${ticker}/press-releases') or {
        eprintln('Initial request failed for: ${ticker} ${err}')
        return pool.no_result
    }
    if resp.status_code != 200 {
        eprintln('Failed requesting press release: ${resp.status}')
        return pool.no_result
    }
    eprintln(resp.body#[..20])
    return pool.no_result
}

fn main() {
    mut pp := pool.new_pool_processor(callback: get_news)
    pp.work_on_items(['inab', 'cyto', 'top', 'tsla', 'tela'])
}

produces:

#0 11:33:36 ᛋ master /v/vnew❱v -d net_blocking_sockets -d use_openssl run yahoo.v 
<!DOCTYPE html><html
<!DOCTYPE html><html
<!DOCTYPE html><html
<!DOCTYPE html><html
<!DOCTYPE html><html
#0 11:33:56 ᛋ master /v/vnew❱

ttytm commented 9 months ago

Awesome, thanks @spytheman !

shove70 commented 9 months ago

@ttytm , Look at it when you're free, thanks. It looks like it can be closed :)

ttytm commented 9 months ago

Yes, the issue is solved. Thank you! Although the time it takes to make the requests is imho still a dealbreaker to use the stdlib for this.

shove70 commented 9 months ago

Thanks for your feedback!

Yes, at the moment it is only based on existing fixes, not structural and essential improvements. Such as "partial requests" and so on, which are not yet implemented. I would like to find time to improve it further

vlang / v