klippa-app / go-pdfium

Easy to use PDF library using Go and PDFium
MIT License
193 stars 16 forks source link

Web assembly #60

Closed gedw99 closed 1 year ago

gedw99 commented 2 years ago

https://github.com/bblanchon/pdfium-binaries Has a web assembly version.

golang is very capable in running web assembly. For example Wazero can run wasm with no cgo

why ?

One pdfium for all targets ( web, desktop, server, etc ) No cgo. Easy to debug using chrome . https://blog.noops.land/debugging-webAssembly-from-go-sources-in-chrome-devtools

Anyone interested in exploring this architecture ?

jerbob92 commented 2 years ago

@gedw99 pdfium compiled to wasm does not work currently, whatever compiler you use. Either the compilers need to be extended to have support for the things that pdfium need, or it needs to be patched out of pdfium.

So the problem right now is not specifically Wazero or Emscripten.

gedw99 commented 2 years ago

Got it .. egg on face :)

codefromthecrypt commented 2 years ago

@ncruces wondering if you had to do some heavy lifting or not to get RethinkRaw working nicely in wasm form. If you happen to have hobby or otherwise interest in PDFium feels like this thread is getting stuck and perhaps you have some advice based on your experiences. I'd hate to see folks end in a cul-de-sac regardless of why.

ncruces commented 2 years ago

No, no significant patching necessary .

But that's because dcraw is meant to be a command line program with few dependencies besides libc, which fits the WASI model perfectly.

For instance, I don't even bother exporting functions, I just call main with the correct “command line” arguments.

This is also because dcraw is not the best piece of code in the world. Part of the goal was to sandbox its crashes (and prevent them from corrupting your files). So I do want to tear it down and back up every single time.

None of these considerations apply to pdfium, which actually I have some familiarity with. Unfortunately I don't have a lot of free time, to be honest.

jerbob92 commented 2 years ago

It's quite clear what needs to happen to pdfium to make it work correctly: replace the allocator by one that does not use ASLR and probably also not virtual memory pages. Besides not knowing a lot of CPP, I also don't really have the time to put into it.

But honestly, the biggest issue right now is that I don't have any idea if it is even going to perform so I'm quite hesitant to put in that time, even if I would have it.

codefromthecrypt commented 2 years ago

While it doesn't always work, I was surprised last year. How about adding #hacktoberfest topic to this repo and the same to this issue (possibly re-doing the title and description about the allocator change)? If I find someone with CPP background and some excuse I'll also divert them here.

jerbob92 commented 2 years ago

Exciting news! I just discovered that newer pdfium versions have a new build option pdf_use_partition_alloc, if you set it to false it does not use the partition allocator anymore which prevents a lot of the issues we were having. Was able to compule fine with Emscripten version 3.1.24 and pdfium 5378. This prevents the issues that I was having with the random generator and the broken memory allocation (both caused by the partition allocator).

I'm now able to:

I'm now stuck at loading the page with FPDF_LoadPage, it uses some functionality that Wazero does not support yet. I'm trying to figure it out with them. Another problem later on would be the lack of filesystem support in the standalone build of Emscripten, since it does not have a WASI backend yet. If the FPDF_LoadPage works I'm going to focus on that, since we would need filesystem support to read/write PDF files and also load fonts.

gedw99 commented 2 years ago

hey @jerbob92

Thats amazing stuff !!

Don't know for sure but HackPadFS might help with the File System aspects. Its designed for WASM golang and tinygo. https://github.com/hack-pad/hackpadfs

jerbob92 commented 2 years ago

@gedw99 Pdfium isn't written in Go, it's compiled to WASM using Emscripten, so that package is not usable. We need Emscripten to implement WASI for WASMFS, so that the file requests actually end up at Wazero so that it can handle them. Wazero has support for that so once Emscripten adds it, it will work.

jerbob92 commented 2 years ago

An Emscripten maintainer told me that there are no current plans to create a WASI backend, and that it's probably best to make one ourselves based on one of the existing backends: https://github.com/emscripten-core/emscripten/tree/main/system/lib/wasmfs/backends

Sadly my C++ isn't that good, so not sure if it's going to work out, but I'll try.

ncruces commented 2 years ago

Once you've done FPDF_LoadMemDocument, you may avoid having to implement WASI for IO by using FPDF_LoadCustomDocument instead.

All it requires is that you define an FPDF_FILEACCESS struct. This will need just a bit of C/C++ with a single callback into Go that reads from a io.ReaderAt.

That should be much easier than supporting the entire range of WASI syscalls.

That's only for reading PDFs. Not sure if there's anything similar for writing them.

jerbob92 commented 2 years ago

I'm planning to implement all methods, just like in the cgo version. I already have FPDF_LoadCustomDocument implemented in the cgo version, but it might be problematic in the WASM version.

Since Wazero already supports WASI, there isn't much for me to implement, just the translation layer for WasmFS in Emscripten. And we will need that anyway for the fonts.

jerbob92 commented 1 year ago

So after a lot of changes to Emscripten and Wazero to make pdfium usable on non-web environments and with a lot of help from @codefromthecrypt, a fully working example could be made! I can successfully render pages now!

Initial tests, rendering a fairly simple PDF into a 2000x2000 image: Webassembly: PDF from path (requires WASM host calls): 800ms - 850ms PDF from binary data: 775 - 825ms

CGO: PDF from path: 15-20ms PDF from binary data: 15-20ms

This was measured without the engine initialization and without the FPDF_InitLibrary call, so just the loading of document, the page and the rendering. The CGO executed the exact same calls as the Webassembly version.

gedw99 commented 1 year ago

Wow interesting and amazing work @jerbob92

I wonder if the wasm is faster on multiple runs? Might be a warm up aspect

jerbob92 commented 1 year ago

@gedw99 Tried that for you, secondary renders of the same document are indeed faster (I did close and re-open the file/page/bitmap in the loop), probably it's faster because it doesn't have to load the same fonts again.

For the Webassembly version it takes off about 150ms, so then it becomes 650ms-700ms. For the CGO version it also takes quite some time off, those become like 5-8ms.

I would have expected it to be somewhat slower, but not this much, kinda disappointed.

jerbob92 commented 1 year ago

Ok... I feel a bit stupid, I had a thought this morning in the shower and it was true... I was using the Interpreter of Wazero (and not the Compiler), probably because the tracing only works with the Interpreter and I never switched it back. The Compiler has way way way better speed (from binary data, for some reason secondary renders on the same path is broken in the compiler version):

Initial render: 35ms-50ms Later renders of same document: 20ms-25ms

So, still not as good as the CGO version, but already a lot better. We might be able to make some extra improvements to get the speed up but maybe @codefromthecrypt has some ideas on that. Probably not having to call the host for the invokes will already improve things.

gedw99 commented 1 year ago

Shower thinking always helps :)

I will have a Play with it - feels like it’s too slow compared to CGO still

gedw99 commented 1 year ago

Sorry to ask but is there a makefile for this ? Maybe a wired up example too ?

I am also curious about multi threading it in Wazero and the browser .

for browser it needs to be a web worker with the main dom window loading up 4 web workers ( typical number used ).

jerbob92 commented 1 year ago

@gedw99 I have just updated https://github.com/jerbob92/go-pdfium-wasm to include a complete example. It also has some patches that should be applied to Wazero/Emscripten to make it work but you shouldn't have to worry about that, a compiled pdfium and patched Wazero is included.

Multi-threading should be quite easy by calling r.InstantiateModule with the compiled module multiple times.

This will allow you to do multiple operations concurrently. Be aware that pdfium itself is not multi-threaded though, so you can't do multiple operations on the same instance at the same time.

gedw99 commented 1 year ago

Thank you @jerbob92

will see how I go with it and let you know

gedw99 commented 1 year ago

Hey @jerbob92

you might like this

https://github.com/jerbob92/go-pdfium-wasm/issues/2

Been using it to make dev more streamlined

jerbob92 commented 1 year ago

I have started on the webassembly implementation here: https://github.com/klippa-app/go-pdfium/pull/64 I probably need to do some more work in Emscripten & Wazero to get file writing to work, but apart from that everything should work. It's going to take quite some time though to implement everything.

jerbob92 commented 1 year ago

The functionality in #64 has been completed, all the methods that work in the CGO implementation now also work in the WebAssembly implementation. The release will be on Friday after Wazero has released their v1.0.0.

The latest benchmark indicate that it's about 2x as slow as native, but here's the thing, it's also 2x as fast as the CGO multithreaded go-plugin implementation, depending on what kind of operations you are doing. We are doing a lot of image rendering, which means a lot of data going back and forth, that has to be encoded en decoded over gRPC which takes a lot of time. And the file data itself also goes over gRPC, with the WebAssembly version you can load from a file path or Go reader directly and have it seek over the file which is much more efficient than loading in the complete file.

So, all-in-all, a pretty good competitor for the CGO implementation, for the single-threaded direct CGO version because of the sandboxing and it won't segfault your program in case of CGO errors. For the multi-threaded CGO implementation it's a super-good competitor because it can do all the things that go-plugin couldn't (methods that require callbacks, like form filling, reading from a seekable reader, writing to a Go writer), and has sandboxing while still being about twice as fast.

gedw99 commented 1 year ago

That’s an amazing effort and I really appreciate the professional summary.

will definitely be using this on the Open Science project …

ncruces commented 1 year ago

May I suggest a PR to add go-pdfium to: https://wazero.io/community/users/

jerbob92 commented 1 year ago

@ncruces Yes, was going to do that after release :)

jerbob92 commented 1 year ago

This has been released in v1.4.0 :partying_face: