sparklemotion / mechanize

Mechanize is a ruby library that makes automated web interaction easy.
https://www.rubydoc.info/gems/mechanize/
MIT License
4.39k stars 473 forks source link

Feature Request: Peeking at the beginning of the body (without getting the whole thing) #215

Open epitron opened 12 years ago

epitron commented 12 years ago

I've found myself recently needing the ability to read just the beginning of an HTTP response (just the first 1k), and it doesn't seem like Mechanize is currently outfitted to do this.

Pluggable Parsers let me get a body_io, so I tried to read just the first 400 bytes and then close it. No dice, unfortunately! Mechanize ends up downloading the whole thing!

Is there an easy way to do this? Or would this be a nasty hack?

drbrain commented 12 years ago

Depending upon server support, the Range header can do this.

This retrieves the first 10 bytes:

require 'mechanize'

agent = Mechanize.new
page = agent.get "http://localhost", [], nil, 'Range' => 'bytes=0-9'
p page.body

Apache and WEBrick support range requests, google does not.

For one of the servers I tested, gzip encoding caused an indecipherable response body. I'm not sure if it's a bug of the server or not, though.

Mechanize does not copy headers across redirects at this time, I will fix that.

Adding fetching of the first 400 bytes is possible, but I'm unsure about how to create good API for it.

epitron commented 12 years ago

Cool! Thanks, Dr. B! The range query is a good stop-gap measure. (Good for people who want to resume downloads, as well.)

I think the API for reading the first n bytes could be really simple -- just yield the body a chunk at a time as the data streams in. If the block terminates, then the transfer could also terminate. Something like:

agent.get(whatever).stream_body(blocksize=4096) do |chunk|
  puts "YAY I GOT A CHUNK!! (#{chunk.size} bytes)"
  break
end

Would that be possible, given Mechanize's current structure?

drbrain commented 12 years ago

The greater problem is designing a good API for this. agent.get(uri).stream_body would not work since by the time stream_body is called the body has been downloaded. Changing Mechanize#get to return a promise is too radical.

Changing Mechanize#get to stream if a block is given seems too constricting. Yielding a response object which the user can use to stream like Net::HTTP streaming seems too complicated for Mechanize:

connection.request req do |res|
  res.read_body do |chunk|
    # …
  end
end

So I think different API entirely would be needed.

The technical problems are minor, I foresee having to deal with content-encoding compression (streaming decompression must be added) and proper shutdown of persistent connections (using this feature may reduce performance).

What is your use case for only retrieving the first kilobyte of data?

drbig commented 11 years ago

This may be related to what I'm missing: can I use it to do a GET request but without ever touching even a bit from body? And no, HEAD request won't do. Also, I want Mechanize to handle all the redirects/authentication/cookies/whatnot - In other words, return me the response as it is at the latest point before Mechanize would start leeching body.

Why: Some sites require authentication and then serve static files via some akami/aws-type backends, which use more or less meaningless ids and only at the last point of redirects/auths etc. they give you the file name via content-disposition. I want to get to that, check if the file doesn't exists, and if so only then read the body.

drbig commented 11 years ago

Plus, I may check size via content-length, or maybe some hash if the backend server provides it.