Closed gedw99 closed 1 year ago
@gedw99 pdfium compiled to wasm does not work currently, whatever compiler you use. Either the compilers need to be extended to have support for the things that pdfium need, or it needs to be patched out of pdfium.
So the problem right now is not specifically Wazero or Emscripten.
Got it .. egg on face :)
@ncruces wondering if you had to do some heavy lifting or not to get RethinkRaw working nicely in wasm form. If you happen to have hobby or otherwise interest in PDFium feels like this thread is getting stuck and perhaps you have some advice based on your experiences. I'd hate to see folks end in a cul-de-sac regardless of why.
No, no significant patching necessary .
But that's because dcraw
is meant to be a command line program with few dependencies besides libc
, which fits the WASI model perfectly.
For instance, I don't even bother exporting functions, I just call main
with the correct “command line” arguments.
This is also because dcraw
is not the best piece of code in the world. Part of the goal was to sandbox its crashes (and prevent them from corrupting your files). So I do want to tear it down and back up every single time.
None of these considerations apply to pdfium
, which actually I have some familiarity with. Unfortunately I don't have a lot of free time, to be honest.
It's quite clear what needs to happen to pdfium to make it work correctly: replace the allocator by one that does not use ASLR and probably also not virtual memory pages. Besides not knowing a lot of CPP, I also don't really have the time to put into it.
But honestly, the biggest issue right now is that I don't have any idea if it is even going to perform so I'm quite hesitant to put in that time, even if I would have it.
While it doesn't always work, I was surprised last year. How about adding #hacktoberfest topic to this repo and the same to this issue (possibly re-doing the title and description about the allocator change)? If I find someone with CPP background and some excuse I'll also divert them here.
Exciting news! I just discovered that newer pdfium versions have a new build option pdf_use_partition_alloc
, if you set it to false it does not use the partition allocator anymore which prevents a lot of the issues we were having. Was able to compule fine with Emscripten version 3.1.24 and pdfium 5378. This prevents the issues that I was having with the random generator and the broken memory allocation (both caused by the partition allocator).
I'm now able to:
_initialize
/ FPDF_InitLibrary
)FPDF_LoadMemDocument
)FPDFBitmap_CreateEx
/ FPDFBitmap_Create
)I'm now stuck at loading the page with FPDF_LoadPage
, it uses some functionality that Wazero does not support yet. I'm trying to figure it out with them. Another problem later on would be the lack of filesystem support in the standalone build of Emscripten, since it does not have a WASI backend yet. If the FPDF_LoadPage
works I'm going to focus on that, since we would need filesystem support to read/write PDF files and also load fonts.
hey @jerbob92
Thats amazing stuff !!
Don't know for sure but HackPadFS might help with the File System aspects. Its designed for WASM golang and tinygo. https://github.com/hack-pad/hackpadfs
@gedw99 Pdfium isn't written in Go, it's compiled to WASM using Emscripten, so that package is not usable. We need Emscripten to implement WASI for WASMFS, so that the file requests actually end up at Wazero so that it can handle them. Wazero has support for that so once Emscripten adds it, it will work.
An Emscripten maintainer told me that there are no current plans to create a WASI backend, and that it's probably best to make one ourselves based on one of the existing backends: https://github.com/emscripten-core/emscripten/tree/main/system/lib/wasmfs/backends
Sadly my C++ isn't that good, so not sure if it's going to work out, but I'll try.
Once you've done FPDF_LoadMemDocument
, you may avoid having to implement WASI for IO by using FPDF_LoadCustomDocument
instead.
All it requires is that you define an FPDF_FILEACCESS
struct. This will need just a bit of C/C++ with a single callback into Go that reads from a io.ReaderAt
.
That should be much easier than supporting the entire range of WASI syscalls.
That's only for reading PDFs. Not sure if there's anything similar for writing them.
I'm planning to implement all methods, just like in the cgo version. I already have FPDF_LoadCustomDocument implemented in the cgo version, but it might be problematic in the WASM version.
Since Wazero already supports WASI, there isn't much for me to implement, just the translation layer for WasmFS in Emscripten. And we will need that anyway for the fonts.
So after a lot of changes to Emscripten and Wazero to make pdfium usable on non-web environments and with a lot of help from @codefromthecrypt, a fully working example could be made! I can successfully render pages now!
Initial tests, rendering a fairly simple PDF into a 2000x2000 image: Webassembly: PDF from path (requires WASM host calls): 800ms - 850ms PDF from binary data: 775 - 825ms
CGO: PDF from path: 15-20ms PDF from binary data: 15-20ms
This was measured without the engine initialization and without the FPDF_InitLibrary
call, so just the loading of document, the page and the rendering. The CGO executed the exact same calls as the Webassembly version.
Wow interesting and amazing work @jerbob92
I wonder if the wasm is faster on multiple runs? Might be a warm up aspect
@gedw99 Tried that for you, secondary renders of the same document are indeed faster (I did close and re-open the file/page/bitmap in the loop), probably it's faster because it doesn't have to load the same fonts again.
For the Webassembly version it takes off about 150ms, so then it becomes 650ms-700ms. For the CGO version it also takes quite some time off, those become like 5-8ms.
I would have expected it to be somewhat slower, but not this much, kinda disappointed.
Ok... I feel a bit stupid, I had a thought this morning in the shower and it was true... I was using the Interpreter of Wazero (and not the Compiler), probably because the tracing only works with the Interpreter and I never switched it back. The Compiler has way way way better speed (from binary data, for some reason secondary renders on the same path is broken in the compiler version):
Initial render: 35ms-50ms Later renders of same document: 20ms-25ms
So, still not as good as the CGO version, but already a lot better. We might be able to make some extra improvements to get the speed up but maybe @codefromthecrypt has some ideas on that. Probably not having to call the host for the invokes will already improve things.
Shower thinking always helps :)
I will have a Play with it - feels like it’s too slow compared to CGO still
Sorry to ask but is there a makefile for this ? Maybe a wired up example too ?
I am also curious about multi threading it in Wazero and the browser .
for browser it needs to be a web worker with the main dom window loading up 4 web workers ( typical number used ).
@gedw99 I have just updated https://github.com/jerbob92/go-pdfium-wasm to include a complete example. It also has some patches that should be applied to Wazero/Emscripten to make it work but you shouldn't have to worry about that, a compiled pdfium and patched Wazero is included.
Multi-threading should be quite easy by calling r.InstantiateModule
with the compiled
module multiple times.
This will allow you to do multiple operations concurrently. Be aware that pdfium itself is not multi-threaded though, so you can't do multiple operations on the same instance at the same time.
Thank you @jerbob92
will see how I go with it and let you know
Hey @jerbob92
you might like this
https://github.com/jerbob92/go-pdfium-wasm/issues/2
Been using it to make dev more streamlined
I have started on the webassembly implementation here: https://github.com/klippa-app/go-pdfium/pull/64 I probably need to do some more work in Emscripten & Wazero to get file writing to work, but apart from that everything should work. It's going to take quite some time though to implement everything.
The functionality in #64 has been completed, all the methods that work in the CGO implementation now also work in the WebAssembly implementation. The release will be on Friday after Wazero has released their v1.0.0.
The latest benchmark indicate that it's about 2x as slow as native, but here's the thing, it's also 2x as fast as the CGO multithreaded go-plugin implementation, depending on what kind of operations you are doing. We are doing a lot of image rendering, which means a lot of data going back and forth, that has to be encoded en decoded over gRPC which takes a lot of time. And the file data itself also goes over gRPC, with the WebAssembly version you can load from a file path or Go reader directly and have it seek over the file which is much more efficient than loading in the complete file.
So, all-in-all, a pretty good competitor for the CGO implementation, for the single-threaded direct CGO version because of the sandboxing and it won't segfault your program in case of CGO errors. For the multi-threaded CGO implementation it's a super-good competitor because it can do all the things that go-plugin couldn't (methods that require callbacks, like form filling, reading from a seekable reader, writing to a Go writer), and has sandboxing while still being about twice as fast.
That’s an amazing effort and I really appreciate the professional summary.
will definitely be using this on the Open Science project …
May I suggest a PR to add go-pdfium to: https://wazero.io/community/users/
@ncruces Yes, was going to do that after release :)
This has been released in v1.4.0 :partying_face:
https://github.com/bblanchon/pdfium-binaries Has a web assembly version.
golang is very capable in running web assembly. For example Wazero can run wasm with no cgo
why ?
One pdfium for all targets ( web, desktop, server, etc ) No cgo. Easy to debug using chrome . https://blog.noops.land/debugging-webAssembly-from-go-sources-in-chrome-devtools
Anyone interested in exploring this architecture ?