Closed jviereck closed 11 years ago
Julian, thanks for taking the initiative on this!
I think this is a great tool - when our extension is released into the wild, I believe this will offer a sample of (broken) PDFs that is more representative of our end-users' PDFs than those reported on Github.
I do have a suggestion on where to store and what to do with such PDFs that I think will ease the pain for both users and developers: Store them into @brendandahl's corpus report framework.
If we use that framework we don't have to ask our users what's wrong with the PDF (as his framework already detects visual differences and JS errors/warnings/todos), nor do we have to remember to close a number of issues whenever we fix a new feature (as his framework automatically "closes" a broken PDF by detecting a "pass").
As a side bonus, we get better statistics on the top errors/warnings/todos/etc that matter to our end-users.
At first I thought we could simply store them in our existing corpus. However, as wisely pointed out by @wfwalker: that would mean we'd be chasing an ever-growing completion metric, and therefore we should probably have two separate "buckets" - one for quarterly goals based on PageRank, and another for user-submitted PDFs. (There should be a considerable overlap of top errors in the two, but it's still be nice to keep things separate).
@brendandahl @notmasteryet @vingtetun What are your thoughts on this?
Have it integrated with the corpus report framework sounds a good idea. We could have two groups of broken PDFs, the one from google that we compare against right now and reported ones.
We could expand the current corpus report framework to have a feature like tagging, such that we can add the broken type (wrong character, wrong character width) as well. As the framework already collects what todos are in the pdf, we might be able to do some kind of "auto-tagging".
The only downside is, that we have to generate images of the pdfs. That means, we should do some human checking before publishing this information publicly.
The corups report requires to be run on a mac as we compare against preview. Therefore we can't do this feedback integration automatically, which isn't possible due to the human checking anyway. What we can do is store the reported links somewhere, write a script that fetches the not yet processed links, downloads the pdfs and if they are valid, generates the images and pushes the current version public.
The corpus report sounds great too me. Is it automated yet?
The only downside is, that we have to generate images of the pdfs. That means, we should do some human checking before publishing this information publicly.
Do we have to make the report public? Can't we just use it internally?
The corups report requires to be run on a mac [...] Therefore we can't do this feedback integration automatically
Macs are mighty man! :) I betcha we can make it work. We'd need a dedicated Mac instance with a fixed IP, and a Node.js script (for example) to listen for PDF uploads and to integrate with Brendan's code. It can be "smart", like checking for md5 signatures of existing PDFs to avoid duplicates, etc.
If we scale to a point where it's bogging down our Mac server we can start doing random sampling.
I can definitely help with this :)
Do we have to make the report public? Can't we just use it internally?
No, but Mozilla tries hard to be as open as possible. Therefore, we shouldn't hold back any data that might be useful for other projects as well.
Macs are mighty man! :)
Most of Mozilla's infrastructure runs on Mac minis. Maybe we can get one of these for testings.
So another option is we try to get our pdf's matching better with poppler on linux. This could be as easy as just running ff and poppler on linux. I have it on my todo list to try, but if someone wants to try, just create a few snapshots with poppler and pdf.js on linux and compare the two using perception diff.
we shouldn't hold back any data
Unfortunately I think we'll have to, for legal reasons - the PDFs that users submit will likely be protected by copyright. This is an issue similar to #822:
https://github.com/mozilla/pdf.js/pull/822#issuecomment-2839077
another option is we try to get our pdf's matching better with poppler on linux
That sounds great, that would mean we'd have only one (ec2) machine to maintain. I can help set it up there once it's ready for staging (I'm overdue to install VMware and Ubuntu on my machine...).
I don't think we are going to proceed with this. Closing?
This issue discuss which data is stored for the feedback feature and evaluates how and where this is done.
This is also about determing if storing the data fits the Mozilla guidelines of user data handling/storing.
The feedback form looks like: http://postimage.org/image/yeych56wr/.
You find the different versions about the drafts here:
Version 01: https://gist.github.com/1376532/bb52a352117cfe81a79d3f3e35a82362f31a7900
\cc @arturadib, @wfwalker