prawnpdf / pdf-core

Implements low level PDF features for Prawn (experimental)
Other
23 stars 38 forks source link

Stop requiring IO objects to have #size, #printf #59

Closed maxim closed 12 months ago

maxim commented 2 years ago

Hello folks,

Long story short: this PR makes it so that Renderer#render accepts IO objects that only have method <<, and no longer require methods size and printf.

Long story long: PDF::Core::Renderer supports IO objects. However, it calls output.size and output.printf in a couple of places. Turns out not all IO objects have size and printf. Take Roda Web Framework for instance, with its Streaming plugin. Here's a Roda app that can stream-render a PDF straight into the user's browser.

require 'roda'
require 'prawn'

class Web < Roda
  plugin :streaming

  route do |r|
    r.on('pdf') {
      pdf = Prawn::Document.new do
        text r.params['text']
      end

      response.headers['Content-Type'] = 'application/pdf'

      stream do |out|
        pdf.render(out) # <= This line doesn't work. This PR makes it work.
      end
    }
  end
end

Notice how we pass out to pdf.render. It fails because Renderer tries to check out.size and call out.printf. But Roda's stream object doesn't have those methods.

You get an error like this:

NoMethodError: undefined method `size' for #<Roda::RodaPlugins::Streaming::Stream>

          ref.offset = output.size
                             ^^^^^>

I think it's a good idea not to assume that output objects have those methods. File and StringIO have a size, but if you're streaming your output into some void (e.g. internet, USB wire, STDOUT, etc) you can't have a size there.

Good news is that we can actually keep track of how many bytes we've written on our side, without relying on IO object. I added a small IO wrapper that does just that, and adds consistent support for printf for good measure. Now any IO object gets wrapped, and gains support of size and printf, even if we're rendering into a bottomless pit.

The other good news is that it turned out to be a small and straightforward code change.

Let me know if you could find any reasons why this is a bad idea. I couldn't come up with any. Tested this with Prawn and Roda, and it streamed the file perfectly.

P.S. I found the code in example/lines.rb to be a good fit for writing the spec, so I borrowed it from there.

P.P.S. When rendering into a file, we no longer call file.size repeatedly. This avoid unnecessary fstat reads from disk.

maxim commented 2 years ago

Folks, any reason why the checks aren't running? 🤔

gettalong commented 2 years ago

@maxim I think that is for security reasons. Someone with the appropriate rights needs to activate them but I'm not well versed re Github actions....

pointlessone commented 12 months ago

I don't think this is a useful addition to Prawn at the moment.

This would only resolve compatibility issue with Roda. I understand that streaming might be beneficial for your app. Not having to buffer the document is probably easier on memory. It doesn't provide any advantages in the PDF context. Prawn doesn't support Linearized PDF and as a result the client would have to fully download document before it can start rendering it. Your application can use the provided wrapper as an adapter.

Thank you for taking interest in Prawn and your time.

maxim commented 12 months ago

@pointlessone Gotcha. It's your choice, and I respect it, but leave my counter-arguments here for the record.

  1. In addition to downloading, this PR enables piping PDF output of one command into another in a Linux-like environment.

  2. Currently you're forced to fully write out a file to memory or filesystem. In PDF context, this restriction is harmful, because there are other output destinations in hosting/multi-service environments, that need to receive a complete and valid PDF file. Is it pdf generator's responsibility to dictate where its output can be directed? Is there a technical reason why some destinations require a sub-optimal workaround? This PR helps answer "no" to both of these questions.

pointlessone commented 12 months ago

Yeah, I will actually reduce requirements for the output object to only << but it's going to use internal buffer. And the reason is actually trivial: current public API returns a string of the fully serialised document and I don't want to break it.

  1. Prawn lets you get a string of your document. In your application you can do whatever you want with it. It's not like Prawn can only output to a file descriptor. You can send that string over the network, you can write it to STDOUT, you can discard it, or do whatever you can do with any other string.
  2. I'm not quite sure what you're getting at here. Prawn doesn't generate PDF that can be processed before the last byte is written so there's not much optimisation in sending out every byte. Somewhere these bytes would have to be kept together anyway. Prawn doesn't strive to be the most efficient implementation either. You've probably chosen a wrong stack if your app can't hold a few MB of serialised document for a few seconds in memory. If you really need to flush every byte over the network then you can supply an output object similar to the wrapper in this PR to Prawn. As you demonstrated it's not that hard to implement. At the moment I'm not convinced this particular optimisation would substantially improve performance for most Prawn users.

    However, this is only my gut feeling. If you show me benchmarks proving otherwise we can get back to seeing how to implement it.

maxim commented 12 months ago

Yeah, I will actually reduce requirements for the output object to only << but it's going to use internal buffer. And the reason is actually trivial: current public API returns a string of the fully serialised document and I don't want to break it.

That makes sense. Prawn could provide a different output method that doesn't buffer, to avoid breaking API, but I wouldn't want you to have to maintain something you aren't entirely behind.

Somewhere these bytes would have to be kept together anyway.

Exactly, the important question is where. If you have a service that produces pdfs, it is unnecessary for this service to have the ram or disk to hold the result. However, I agree that it's not a big deal to make it necessary, or to inject a custom output object. Just thought this behavior is worth being available out of the box.