Closed arturadib closed 12 years ago
@jviereck Can't we use |position: absolute| to overlap image + text? @cgjones Does the performance will be really different between a SVG port and a HTML port? I know the SVG port is a bit slower than the canvas one, but I wonder how much we can reduce the difference.
Also it will not solve our selection problem that easily since we still need to 'guess' the page formatting before building a DOM if we want the selection to works correctly, but that's probably the same type of issue we will encounter with SVG.
Also if we take the SVG road we will probably miss the Q3 goal since it required fixing the gecko platform and even if someone achieve this if won't be ready before Firefox 10/11.
In particular Julian seemed to think that certain things (overlapping image+text?)
What most of the HTML5 based renderer do: They move all the painting to the background and put all the text on top. That means, whenever you have some graphics that covers some text, you still see the text. This doesn't seem to be happening that often. Maybe we can add some algorithms to determ if text overlapping is needed and otherwise just put it to the background and text on top.
@vingtetun we can/have to use position absolute. However, it's not that simple to determ the size of an image/drawing shape. The easiest way is to add a new canvas for each rendering that is needed that fills up the entire page, but that's gone be slow.
In terms of printing, the HTML5 version should be pritty good in terms of text printing but the drawing are gone be rastered. We might be able to render the images/drawings using SVG and the text using divs.
My plan is to get the WebWorker patch landed. Then we have a nice IR queue of drawing commands that we can exam way easier then we do right now. Also want to speak to Andreas about some ideas I had.
PS: @arturadib: Really awesome to get all this kicked off!
I looked into HTML5 before. There are a bunch of problems when drawing overlaps text (you have to layer canvas on html on canvas), but for most documents it should work well.
But is text selection dependent on the order of DOM elements? Lines in the PDF could be out of order semantically but in order visually and text selection would be messed up. For example, columns are sometime put in column order and sometimes in row order... Might have to do some reordering based on visual appearance before adding to the dom if that is the case. This would make it a little harder since you would need to delay all text rendering till the end, but it might work.
There shouldn't be a huge perf difference between SVG and HTML for text; both should be similarly slower than canvas, because of DOM and painting overhead. I would expect SVG to be a bit faster than HTML for scrolling because moving around even pos:absolute elements forces reflow, for a bunch of arcane reasons. There's no reflow in native SVG.
My biggest concern is what folks already noted above, the overlapping text/figure problem. It's not really ok if it works "most of the time": if we screw up rendering a few % of the time no one is going to use pdf.js (portable rendering is the point of PDF in the first place).
The goal of this work is to move towards text selection and a11y, right? I think we should start by understanding the problem better instead of jumping straight to mechanisms to implement solutions. An evince developer was kind of enough to give us a big head start there: http://blog.mozilla.com/cjones/2011/06/15/overview-of-pdf-js-guts/#comment-4100 . Let's see how common those tags are and what it would take to add selection for them. If something dumb like a transparent HTML overlay of the canvas is possible, awesome!
Generally, I would choose pixel-perfect rendering over perfect text selection, if there's a tradeoff to be made. Text selection in untagged PDFs is an undecidable problem ("what was the author thinking?") so we're never going to get it 100% right without help. No other PDF reader can either. What we can get 100% right is rendering, which other PDF readers do (modulo arcane features).
Generally, I would choose pixel-perfect rendering over perfect text selection, if there's a tradeoff to be made.
That's the important question. I talked to someone that does the technical stuff for the swiss jobs website. They have a lot of PDFs they want to show off. The reason I see sites like slideshare go with HTML5 is that it runs on the iOS devices and such very simple. Doing the rendering with SVG/Canvas there is kind overkill and won't work in a nice way on such devices - at least for all devices a person from the enterprise wants it to run on. That's why I think we should implement an HTML5 version that just renders all the graphics to the background and put text on top, noting that there are some limitations if use this solution.
The cool thing about this is also, that it's easy to get it working for handset devices. You can run some code in node that creates the background images for different devices, cache that and if an iPad comes along send the right background image + the font + text there and render it.
What I want to point out is: There might be many solutions for different use cases. <<<
For the usecase running in a browser (desktop in the following only, that has enough resources etc) it depend as well. If you just want to get something rendered on IE6 e.g. you need to use the HTML5 version as well, as everything else is too slow. If someone only aimes for lastest browsers, you can choose something else (e.g. a real PDF viewer for an Firefox Plugin).
The canvas version is something we gone need for the HTML5 version anyway. That's why I think we will have to support that anyway. That said, on some browsers that suppport it, we can use SVG for rendering. Really want to hack on the spick, but the worker stuff still blocks me :(
SVG looks most promising and I think we can get it up to speed by doing some analysis! It also solves the issue of printing and text selection (assume things get fixed in Gecko soon).
Nevertheless I don't think SVG gone be the best solution for the printing problem. If we just have some pages, we can render it all in SVG an print BUT if we have a very big document with 1'000 pages, this doesn't work as we can' have the SVG markup for 1'000 pages in the DOM before we do the actually rendering - that's just gone break each computer in term of memory usage. We defenitly want to do something better and I see no way around some new Web API. But that's maybe something to get to tackle once we have something in place we can work on from.
That's what I was come up to think yesterday evening. Sorry if this turns out into a talk about the universe and everything^^
Obviously I think canvas+WebPrint API+manual text selection impl is the best way to go here. Perhaps it is more work, but I think it will pay off and I'm willing to help! :)
As for mobile devices, I'm not even sure if its worth it to worry about them yet as their JavaScript implementations are probably not fast enough to do all the decoding stuff that pdf.js does yet anyway, and probably are missing some features that would result in less than perfect rendering as well. So probably not worth it yet.
Obviously I think canvas+WebPrint API+manual text selection impl is the best way to go here. Perhaps it is more work, but I think it will pay off and I'm willing to help! :)
That sounds cool, but it's the question of the time frame how long it's gone take to get such a new API implemented. The SVG solution looks better then nothing and we can get it working right now.
Implementing manually text selection might be doable (although it could get painful..), but I'm concerned about accessibility issues and how you actually gone render the blue background that indicates selection. If you have to rerender the entire canvas to just add this little bit of blue, it's gone be way to slow for most device but I couldn't come up with something better. Any idea how other renderer do this or how this is done in the DOM (Chris?)?
As for mobile devices, I'm not even sure if its worth it to worry about them yet as their JavaScript implementations are probably not fast enough to do all the decoding stuff that pdf.js does yet anyway, and probably are missing some features that would result in less than perfect rendering as well. So probably not worth it yet.
Well, there are a lot of people interested to have this. What you can do on mobile devices isn't do a complete rendering of the PDF page, but if you have some kind of prerendered formed served to the device, you can get something working.
From the technical point this might not be too interesting, but we should keep in mind what this project is useful for other people. If we can push the web by making it possible to view any kind of PDF on any web device, this gone be huge. While hacking on Bespin/Skywriter, we concentrated only on the cutting edge technology and never thought about to support older browsers. That aproche is cool, but it might not be the best thing for man kind, if you see what I mean ;)
Could you make a second canvas that only renders the text selection and layer it with the canvas that actually draws the document? Actually you'd end up with 3 canvases layered on top of each other in this order:
They get layered on top of each other so that it looks the same, but is actually rendered across 3 canvases instead of 1.
And yes, you would need an API for accessibility and printing of canvases for this approach.
So for mobile, you're thinking of preprocessing the document on the server? I'm just worried that mobile devices are not going to be fast enough to run all this JS. :)
Mobile works just fine. Its about 5x slower, but still well within whats possible. We are faster than my native code reader on mobile.
Didn't load on my iPad. Blank white pages rendered. Perhaps fixable...
Also, going the HTML5/SVG route prevents the ability to render pages to images/thumbnails easily and experiments like http://scotland.proximity.on.ca/dxr/tmp/CubicVR.js/samples/pdf/. Just thinking out loud. :)
- The main canvas that draws everything in the document except the text.
- The text selection canvas - all it draws is the text selection.
- The text canvas itself - it draws all the text in the document.
How do you handle the case if you have some drawing on top of text? That's the thing that makes this simple thing break :(
Another issue to consider is Type3 fonts since they can't really be converted to an open font. Text selection for them will be hard no matter what we do though I guess.
Text selection has landed in #738. Unassigning self, closing until someone wants to revive the whole new HTML5 backend issue.
If you look at the latest version of crocodoc, it seems that they have foregone their SVG based solution and have moved towards an HTML5 solution - however I can't seem to find any canvas elements regardless of my UA! Their docviewer.js is not doing any processing obviously, so their documents are being pre-processed server side which is the main reason why it works across all mobile devices. https://crocodoc.com/see-it-in-action
We have been toying with using pdf.js server side with node-canvas to convert documents before rendering - which reduces a lot of mobile device compatibility and performance issues - if folks are interested in re-opening this and/or opening another project, lets get the conversation going! Who is up for it?
Gu @acao,
We have been toying with using pdf.js server side with node-canvas to convert documents before rendering - which reduces a lot of mobile device compatibility and performance issues - if folks are interested in re-opening this and/or opening another project, lets get the conversation going! Who is up for it?
I know there are a bunch of people interested in rendering PDFs on the server using PDF.JS. I'd like to have a "node" backend for rendering PDFs in the PDF.JS repo, but I'm not sure what the other team members think about this.
I've opened issue #1664 "Implement "node" backend" for what you mentioned. Let's get some conversation going there.
I'm highly interested in this! :-)
Hello guys, I've started a project rendering PDF in HTML. https://github.com/mozilla/pdf.js/issues/565
And a demo is here: http://coolwanglu.github.com/pdf2htmlEX/demo/demo.html
The rendering speed is ok to me, as I've compress the HTML (especially styling) as much as possible.
Now individual <div>'s are used for a single line with the same styling (font, size, transform) while the spacing between words are padded with <span>'s. And the uniformed letter-spacing & word-spacing from PDF can be adjusted directly using CSS
It's kind of almost accurate now, but one problem is that the browsers (FF & Chrome) may not use the exact font size I've chosen (for example, it would use a 9px size if I specified 9.213px).
I hope this would be useful for PDF.js, and also some of you may be interested in my project :)
@coolwanglu Neat stuff! We'll definitely take a peek at your project if/when we experiment with an HTML5 backend. Thanks!
This is motivated in part by text selection, so it's been assigned to the text selection milestone.
We have the following options for rendering:
(By HTML5 I mean: use all HTML elements, not just
<canvas>
:))I'd like to analyze the pros and cons of a fully HTML5 backend. The main driving force for this is that it seems like the documents industry is investing heavily on this platform:
Check out in particular Crocodoc's rendering of
tracemonkey.pdf
:)It feels pretty smooth and fast to me, and text selection is piece of cake.
I'd like to gather the opinion of the community on this possibility. It'd solve our text selection problem, but it might open another can of worms. In particular Julian seemed to think that certain things (overlapping image+text?) were not possible. Chris pointed out performance bottlenecks. Another concern I can think of is printing.
Can we start a dialog about this?