haystack / nb

12 stars 10 forks source link

The sidebar doesn't load when a pdf has one page only #203

Open JumanaFM opened 2 years ago

JumanaFM commented 2 years ago

currently in nbclient, if a pdf has one page, the sidebar does not load.

semisenioritis commented 1 year ago

does this only happen for single page pdfs or even multi page pdfs?

karger commented 1 year ago

please provide publicly accessible test cases if possible

semisenioritis commented 1 year ago

https://home.ttic.edu/~avrim/book.pdf

This is the textbook that I am using. After some experimentation I realized that the issue is, the library you are using to make the pdf annotatable requires intense preprocessing (around 3-4 minutes for initial setup) and until the entire pdf isn't preprocessed, neither the annotation sidebar nor the annotations, show up. This makes sense, since the core is the pdf while the annotations buildup on the pdf itself, but this becomes a really big issue, when such setup time is required each time, the open window is changed/ tab is changed. This makes the software unusable as the time required for loading is just not bearable.

I tried out the same document with nb1, and found that as each page was rendered as a single image and the annotations where blocklike in nature, thus loosing fine control, the rendering of each page was initiated as and when required, making the process faster. My suggestion would be to provide a highlighting ability, that doesn't directly map to the threads, but maps to a background user-invisible and user non-interactable block style annotation, thus maybe making the system faster.

I'm probably missing out a lot of details since I don't know the code thoroughly, but I'd be happy to help out!

semisenioritis commented 1 year ago

Also, would it be possible to make the code of nb1 publicly available? I wasn't able to find it in the haystack repositories.

karger commented 1 year ago

NB1 is at https://github.com/nbproject/nbproject

karger commented 1 year ago

The right solution to the problem that you've identified is for NB to "process" the pdf (which nowadays means converting it to html for in-browser rendering) on the server once, and store it there, and deliver that HTML directly to the client at time of use, instead of the current approach of shipping the pdf to each client for processing at the time of use. There should be an issue for this but I can't find it; if it really isn't there we should add it @JumanaFM .

semisenioritis commented 1 year ago

Exactly, preprocessing is something that i think is happening on the client side, and if possible it should happen all at once in the pdf uploading process. It will probably save a lot of resources.

On another note, i checked out this same issue with mozila's inbuilt pdf viewer and hypothes.is's pdf annotator as well (both being open source) but neither of them seems to have this issue. Any idea how they manage and if same source code can be used? Mozilla doesnt have the ability to annotate and highlight, but other plugins based on mozila's pdf annotators work pretty smooth too.

Also, thanks for the nb1 link!

karger commented 1 year ago

So far as I know every platform and browser has converged on use of the pdf.js library to render pdf to html.  I presume hypothesis is using that library to do what I described.   It's a simple matter of programming to add this functionality to NB; we just haven't had the resources for it.

On 12/30/2022 1:07 PM, semisenioritis wrote:

Exactly, preprocessing is something that i think is happening on the client side, and if possible it should happen all at once in the pdf uploading process. It will probably save a lot of resources.

On another note, i checked out this same issue with mozzila's inbuilt pdf viewer and hypothes.is's pdf annotator as well (both being open source) but neither of them seems to have this issue. Any idea how they manage and if same source code can be used?

Also, thanks for the nb1 link!

— Reply to this email directly, view it on GitHub https://github.com/haystack/nb/issues/203#issuecomment-1368036792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIWSXXASTFCNBB7O6BRSWLWP4QG7ANCNFSM5KZOBU5Q. You are receiving this because you commented.Message ID: @.***>

semisenioritis commented 1 year ago

Ahh, got it. But i briefly looked at the nb2 source code and you also use pdf.js . Sorry in advance if its a basic question.

karger commented 1 year ago

Right; we use the same canonical library as everyone else. But we're running it every time on the client, when instead it really ought to be run once on the server.

karger commented 1 year ago

If you are looking to contribute this would be a very nice issue to work on.

semisenioritis commented 1 year ago

Im planning to modify nb a bit for my own requirements and I really need to be able to work with big files for this. I'd love to contribute! Any resources for this specific issue you could point me towards?

karger commented 1 year ago

I'd love for you to contribute back anything you think could be helpful to others. In particular this prerendering of large pdfs would be of great general benefit. NB1 did this (it rendered into images instead html, but same idea).

I take it you've already found the client and server code. We're active on the repo discussion and happy to help out if you need help understanding or finding specific things.

semisenioritis commented 1 year ago

Yup I have already setup nb2 on my laptop, but my system kept on crashing because of the local hosting. I think that for some reason, nb2 does both pre-rendering and client side rendering as it took twice the amount of time for my local nb than the hosted nb. Just a guess though.

I'll start with figuring out how nb1 rendered images so that I can use that here.

karger commented 1 year ago

I recommend against the nb1 approach.

in NB1 we rendered PDFs to images, which loses all information about the flow of text.  that's why you can only highlight rectangles in NB1---the lines don't exist at that point.   For most applications, preserving the text flow by rendering to HTML is far superious.

One special case is image annotation---there you do want to be able to highlight and annotate specific rectangular parts of the image, since there is no flowing text.  You can do that in NB1 since any embedded images also get flattened onto the pdf image. In contrast, right now in NB2 you can only annotate all or none of the image.   Fixing that is also on the todo list.

You may be wondering, why NB2 is less powerful then NB1 on these issues; the answer is that NB2 is a dramatic improvement over NB1 in a huge number of other directions, while what I've discussed are the (only) two key sacrifices we made to get there.

In particular, the NB1 code is a complete nightmare.  You won't find anything of use there.

On 12/30/2022 1:43 PM, semisenioritis wrote:

Yup I have already setup nb2 on my laptop, but my system kept on crashing because of the local hosting. I think that for some reason, nb2 does both pre-rendering and client side rendering as it took twice the amount of time for my local nb than the hosted nb. Just a guess though.

I'll start with figuring out how nb1 rendered images so that I can use that here.

— Reply to this email directly, view it on GitHub https://github.com/haystack/nb/issues/203#issuecomment-1368048009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIWSXVOR77GDG34PFEIHZLWP4UMXANCNFSM5KZOBU5Q. You are receiving this because you commented.Message ID: @.***>

semisenioritis commented 1 year ago

Totally agreed. Especially the points about pdf to images. I initially wanted to use nb1 when that was the only option available but then I realized that the without fine control over the text the context of the related question would be lost on the readers. This wouldn't be a very big issue, and was easily workaround able, but just made me postpone my project for later.

Nb2 did initially feel like more of a frontend modification at the cost of speed, but as i went deeper, i realized that a lot of features were added making it more user-friendly.

But if I shouldn't even refer to nb1 code, where is a good place to start?

karger commented 1 year ago

Are you asking specifically about how to tackle server side rendering in nb2?

semisenioritis commented 1 year ago

Yes. Maybe some resource or something I can look into or something that already implements this well.

JumanaFM commented 1 year ago

Yes. Maybe some resource or something I can look into or something that already implements this well.

This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html Take a look and contribute if you can, we appreciate it!

karger commented 1 year ago

At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request.

semisenioritis commented 1 year ago

Yes. Maybe some resource or something I can look into or something that already implements this well.

This is how it's done on NB currently https://github.com/haystack/nb/blob/7f0e24a07db0b5de1f54c5d4f20114a14d994f73/public/nb_viewer.html Take a look and contribute if you can, we appreciate it!

really helpful, thanks!

semisenioritis commented 1 year ago

why not just save the generated html file on the server, deleting the original pdf?

At present, nb_viewer fetches the target pdf from the nb server, then uses the pdf.js library to convert it to html that nb can annotate. we should instead be using the same pdf.js library on the server, to convert the pdf to html there once, then save the resulting html in a suitable cache directory so that html can be served on request.

semisenioritis commented 1 year ago

what im thinking is that once the professor uploads the file on the server, the server takes the file converts it to a html file and saves that file for all later use. if the student/professor wants to download the file as a pdf, we perform the same thing in reverse on the server and provide the document

semisenioritis commented 1 year ago

It seems that converting pdfs to html documents doesnt always workout and most of the files have their own specific fonts without which the file gets corrupted. Also I looked a bit deeper into the hypothesis code and it seems that they arent using the pdf to html system either. Not really sure how to proceed at this point

karger commented 1 year ago

We definitelyy don't want to delete the pdf, because there will be some pdfs whose renderings will get better as newer versions of the pdfjs library are released.   But we should indeed be saving the generated html file on the server.

Note that running pdfjs on the server to do the conversion should be easy; it's a js library and our nodejs server is js based.   I don't know if pdfjs offers any warnings when it has trouble converting; if so we should get those delivered back to the person who uploaded the pdf.

On 1/5/2023 11:45 AM, semisenioritis wrote:

why not just save the generated html file on the server, deleting the original pdf?

At present, nb_viewer fetches the target pdf from the nb server,
then uses the pdf.js library to convert it to html that nb can
annotate. we should instead be using the same pdf.js library on
the /server/, to convert the pdf to html there once, then save the
resulting html in a suitable cache directory so that html can be
served on request.

— Reply to this email directly, view it on GitHub https://github.com/haystack/nb/issues/203#issuecomment-1372463026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIWSXR2TEWXIYSU6XIOZPLWQ33EJANCNFSM5KZOBU5Q. You are receiving this because you commented.Message ID: @.***>

karger commented 1 year ago

pdfs that cannot be converted are just as big a problem with the current system as they would be with server-side conversion---it's the same library either way. So we're no worse off doing the conversion server side.

But such problematic pdfs are rare and getting rarer, because pdfjs is also the library that gets used by firefox to render pdfs in the browser, so it gets lots of attention.

Google chrome uses a different conversion library, pdfium, for the same purpose. We could use that library instead of pdfjs if we decided it was more robust. Pdfium would have to run in a separate process since it isn't js based, but we could easily have our server invoke it at need, using for example this python wrapper.

semisenioritis commented 1 year ago

Riiight, that makes sense. Ill try this

semisenioritis commented 1 year ago

@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying close to the source code

JumanaFM commented 1 year ago

@JumanaFM sorry for bothering you again and again but is there any documentation for pdf.js at all? no matter where I search I cant seem to find any documentation for the library at all. The official docs point to links that are incomplete and the only documentation that exists is user contributed and doesn't make a lot of sense ((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)). Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying close to the source code

Not a bother, happy to help! The best resource is the official page https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis https://github.com/hypothesis/pdf.js-hypothes.is

karger commented 1 year ago

It might be worth investigating online which of pdf.js and pdfium is considered most robust/able to handle the most pdf weirdness/produces the best html

all we do is invoke it for conversion, so the coupling to nb is very light---so it would probably be quite easy to switch, though we would need to keep using pdfjs for the legacy documents since we rely on the converted html being the same every time.

On 1/7/2023 8:43 PM, Jumana Almahmoud wrote:

@JumanaFM <https://github.com/JumanaFM> sorry for bothering you
again and again but is there any documentation for pdf.js at all?
no matter where I search I cant seem to find any documentation for
the library at all. The official docs point to links that are
incomplete and the only documentation that exists is user
contributed and doesn't make a lot of sense
((https://github.com/MeiKatz/pdfjs-docs/blob/master/README.md)).
Where did you refer for the documentation?

I dont mind switching to pdfium but if i can I'd prefer staying
close to the source code

Not a bother, happy to help! The best resource is the official page https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis https://github.com/hypothesis/pdf.js-hypothes.is

— Reply to this email directly, view it on GitHub https://github.com/haystack/nb/issues/203#issuecomment-1374682818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIWSXT2WJVDFA6ZNQDPOCLWRILU3ANCNFSM5KZOBU5Q. You are receiving this because you commented.Message ID: @.***>

semisenioritis commented 1 year ago

Not a bother, happy to help! The best resource is the official page https://mozilla.github.io/pdf.js/

Another resource that might be helpful is hypothesis https://github.com/hypothesis/pdf.js-hypothes.is

Thanks a lot!! I found a few more random resources, but the best docs are in the examples on the official page itself. Not a lot to go by, but you can get a brief overview.

semisenioritis commented 1 year ago

It might be worth investigating online which of pdf.js and pdfium is considered most robust/able to handle the most pdf weirdness/produces the best html all we do is invoke it for conversion, so the coupling to nb is very light---so it would probably be quite easy to switch, though we would need to keep using pdfjs for the legacy documents since we rely on the converted html being the same every time.

Sure ill look into comparing both too