socketry / async-http

MIT License
298 stars 45 forks source link

Streaming `Async::HTTP::Internet` response #55

Closed bruno- closed 4 years ago

bruno- commented 4 years ago

Hi,

I'm not reporting a gem issue, but instead I'm struggling to make the below code work. I'm trying to make asynchronous requests where response IO object is passed to nokogiri SAX parser.

require "async"
require "async/http/internet"
require "nokogiri"

urls = %w(
  https://www.codeotaku.com/journal/2018-11/fibers-are-the-right-solution/index
  https://www.codeotaku.com/journal/2020-04/ruby-concurrency-final-report/index
)

class HtmlDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    puts name
  end
end

Async do |task|
  internet = Async::HTTP::Internet.new
  parser = Nokogiri::HTML::SAX::Parser.new(HtmlDocument.new)

  urls.each do |url|
    task.async do
      response = internet.get(url)
      puts "#{url} #{response.status}"

      # parser.parse(response.read) # using this line works, but not streaming
      parser.parse_io(response.peer.io) # this line errors πŸ’₯
    end
  end
end

This is the problem line: parser.parse_io(response.peer.io) - it errors with this:

 1.72s    error: Async::Task [oid=0x2bc] [pid=74458] [2020-06-03 21:42:30 +0200]
               |   Protocol::HTTP2::FrameSizeError: Protocol::HTTP2::Frame (type=112) frame length 6841632 exceeds maximum frame size 1048576!
               |   β†’ /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/frame.rb:181 in `read'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/framer.rb:95 in `read_frame'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/protocol-http2-0.14.0/lib/protocol/http2/connection.rb:161 in `read_frame'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/async-http-0.52.4/lib/async/http/protocol/http2/connection.rb:106 in `block in read_in_background'
               |     /Users/bruno/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/async-1.26.1/lib/async/task.rb:258 in `block in make_fiber'

The #parse_io method needs to be passed an IO and I've been unable to figure it out. Any hints? Thank you very much πŸ™

ioquatix commented 4 years ago

You cannot use the underlying IO for streaming the response because with HTTP/2 for example, there are multiple requests being multiplexed on the same IO. In addition, the binary framing format is not the raw data you expect.

require 'async'
require 'async/http/internet'

Async do
    internet = Async::HTTP::Internet.new

    response = internet.get(...)
    pipe = Async::HTTP::Body::Pipe.new(response.body)

    parse(pipe.to_io)
end

This makes an adapter socket around the HTTP data stream. Hopefully this helps you enough to figure it out - let me know if not.

bruno- commented 4 years ago

This was very helpful, thank you very much.