klippa-app / go-pdfium

Easy to use PDF library using Go and PDFium
MIT License
182 stars 14 forks source link

Web assembly #60

Closed gedw99 closed 1 year ago

gedw99 commented 2 years ago

https://github.com/bblanchon/pdfium-binaries Has a web assembly version.

golang is very capable in running web assembly. For example Wazero can run wasm with no cgo

why ?

One pdfium for all targets ( web, desktop, server, etc ) No cgo. Easy to debug using chrome . https://blog.noops.land/debugging-webAssembly-from-go-sources-in-chrome-devtools

Anyone interested in exploring this architecture ?

jerbob92 commented 2 years ago

That does sound interesting! Will do some tests soon.

jerbob92 commented 2 years ago

@gedw99 do you perhaps have an idea how to call the pdfium WASM from the go-pdfium WASM binary?

jerbob92 commented 2 years ago

I tried to build the example Go as WASM, but it has some weird behavior (it doesn't compile)

gedw99 commented 2 years ago

Sorry for lack of response. Came down with sone flu.

will try this when back on my feet.

Wazero should in theory be able to be used to host the wasm. Wazero can be embedded with any golang program. They have many examples and there are many on GitHub.

The way the wasm was compiled however is important. Was it compiles to run in a browser or outside a browser ? Also I know Wazero has / is working on being able to run both .

You can also ask the Wazero team - they are really proactive

gedw99 commented 2 years ago

Line 12 looks like emscripten. https://github.com/bblanchon/pdfium-binaries/blob/master/steps/06-build.sh

jerbob92 commented 2 years ago

I understand, but when pdfium is compiled for WASM, and go-pdfium is compiled in WASM, that doesn't mean that the go-pdfium WASM can interface with the pdfium WASM, so that's the problem I'm trying to figure out right now.

gedw99 commented 2 years ago

So it sounds like your trying to run both in a browser ? The answer to this I don’t know. I would ask the Wazero team. You raise a very good point and I have not delved into this .

for server, desktop and mobile though here is an example that calls into code compiled to wasm with emscripten: https://github.com/tetratelabs/wazero/blob/866fac2e969c1d45ce2459355de88a6395202aae/emscripten/emscripten_example_test.go

jerbob92 commented 2 years ago

I'm not trying to run anything yet, I'm just trying to figure out how this would work in theory.

pdfium compiled into WASM isn't how Go normally integrates with libraries, because that happens through cgo. In this case Go would need to know that it has to find the cgo implementations in the seperate pdfium WASM binary. I'm just seeing how that would work.

gedw99 commented 2 years ago

Everything you say I agree with.

basically Wazero runs the wasm compiled pdfium.

That’s why I added the link to the example. I would start there.

It’s not complicated . In fact less complicated than cgo imho. Give it a try !! The Wazero team will help if they can. Just make a reproduction repo for them ( or branch ).

I am still sick in bed mate so can’t give it a try myself .

gedw99 commented 2 years ago

It’s does NOT need the CGO stuff , to answer your question

gedw99 commented 2 years ago

Your able to call functions in the wasm from golang by using Wazero

jerbob92 commented 2 years ago

I understand what you're saying, but that means that I would have to rewrite every pdfium C++ implementation that I have right now to work with Wazero/pdfium WASM, it's not going to happen.

If we can compile go-pdfium into WASM, and it could automatically call into the pdfium WASM, as if the pdfium WASM is the C++ library that go-pdfium would normally call into, then that would be perfect.

gedw99 commented 2 years ago

True you will have to rewrite the calls. There is no cgo involved. Your bypassing cgo essentially.

It all comes down to if it’s worth it for you .

Speed , throughput might be slower or higher. I would try one or two functions first and do a benchmark composition by golang tests

jerbob92 commented 2 years ago

I don't really see any advantage of that right now, but if you're willing to give that's fine with me. If we can have a solution that supports both cgo and WASM, that would be nice.

The only advantage of a go-pdfium WASM version for me would be able to run it in a place where you can't run the pdfium C++ library directly, so the browser for example.

gedw99 commented 2 years ago

I don’t have time . But I expect someone else will. Maybe tag this issue with something appropriate

gedw99 commented 2 years ago

All wasm runtimes are using wit format as a DSL . It’s replaces CGO etc is a simplified way to think about it .

https://github.com/theduke/wasi-sql/blob/main/schema/sql_v1_alpha1.wit

see : https://github.com/tetratelabs/wazero/issues/662

it will code gen the DSL for you

jerbob92 commented 2 years ago

Yeah, but I don't see the added value of that right now, why is that better than CGO? Is it faster? Easier to implement? Easier to deploy? Does it give more flexibility in the deployment?

gedw99 commented 2 years ago

It should be much faster . no cgo and wasm via em++ produces lean wasm

deployment is easier because the desktop and mobile is runtime linking mess is 100% bypassed. You just loading wasm that you embedded using normal golang embedding

easier to manage and implement because one version runs everywhere . Clearly way easier to maintain imho.

gedw99 commented 2 years ago

Your threading ide case should also be easier

and you gain security sandboxing in the cloud. Docker is not secure in this sense.

jerbob92 commented 2 years ago

Sounds good! Looking forward to your benchmarks!

gedw99 commented 2 years ago

I don’t have time to work on this ,

jerbob92 commented 2 years ago

Me neither :laughing:

codefromthecrypt commented 2 years ago

Interesting thread. FYI, we've opened a gophers slack wazero channel for chatter as you need it. Also, we notice a lot of people struggle with wasm in general (including ourselves 😊) so started adding notes pages which may help a bit. https://wazero.io/languages/

codefromthecrypt commented 2 years ago

ps on no CGO there's also a cool devops win which you don't need to care about the OS or install shared libraries etc. https://gist.github.com/codefromthecrypt/edb33284354d592dc6056b9b7263872f

jerbob92 commented 2 years ago

@codefromthecrypt would it be possible to generate Go code that calls into the WASM automatically? So that you have an actual Go interface like in CGO, and not the ExportedFunction?

codefromthecrypt commented 2 years ago

@jerbob92 I think that's what @knqyf263 is trying to do with https://github.com/knqyf263/go-plugin

jerbob92 commented 2 years ago

@codefromthecrypt I think that's rather for Go programs compiled into WASM. In this case we're trying to call into the prebuilt PDFium WASM.

codefromthecrypt commented 2 years ago

right I guess most common would be TinyGo. It can import functions from other wasm, as well export its own https://wazero.io/languages/tinygo/

codefromthecrypt commented 2 years ago

and sorry for the possibly misaimed advice, but this project is also using pre-built wasm, which wasn't built specifically for go https://github.com/ncruces/RethinkRAW

jerbob92 commented 2 years ago

Thanks! I'll look some more into it. It seems like RethinkRAW just executes the WASM as a runnable binary.

codefromthecrypt commented 2 years ago

I did some poking and I think first thing could be to check the viability of pdfium publishing another Dist which isn't a JS one, rather a standalone (assuming that's possible). That reduces the implementation surface and can also help clarify the exports as well. In any case the WebAssembly function exports from pdfium (which are done with emscripten), can be called either by host code (ExportedFunction) or something that compiles to wasm like tinygo or zig. Basically, you are using pdfium as a lib, but you can also have another %.wasm import its functions such as FPDF_LoadPage or whatever. I'm not experienced enough to figure out if this is viable or not, but hope the breadcrumbs help!


So, the first thing I noticed is this isn't built for -s STANDALONE_WASM. It is using a lot more imports from emscripten and whatnot. If possible, it would be nice for https://github.com/bblanchon/pdfium-binaries to make a standalone version, as otherwise there will be a lot of whack-a-mole, and that time spent might be a lot more than the former.

Ex. Only a couple imports below are built into wazero, like wasi_snapshot_preview1. Some of the others are javascript mappings.

$ wasm2wat ~/Downloads/pdfium-wasm/lib/pdfium.wasm |grep '(import'
  (import "env" "abort" (func (;0;) (type 14)))
  (import "env" "emscripten_resize_heap" (func (;1;) (type 1)))
  (import "env" "emscripten_memcpy_big" (func (;2;) (type 4)))
  (import "env" "__sys_mmap2" (func (;3;) (type 10)))
  (import "env" "__sys_munmap" (func (;4;) (type 3)))
  (import "env" "__sys_mprotect" (func (;5;) (type 4)))
  (import "env" "__sys_madvise1" (func (;6;) (type 4)))
  (import "env" "__sys_getpid" (func (;7;) (type 13)))
  (import "env" "gettimeofday" (func (;8;) (type 3)))
  (import "wasi_snapshot_preview1" "environ_sizes_get" (func (;9;) (type 3)))
  (import "wasi_snapshot_preview1" "environ_get" (func (;10;) (type 3)))
  (import "wasi_snapshot_preview1" "fd_close" (func (;11;) (type 1)))
  (import "wasi_snapshot_preview1" "fd_write" (func (;12;) (type 6)))
  (import "wasi_snapshot_preview1" "fd_fdstat_get" (func (;13;) (type 3)))
  (import "env" "__cxa_atexit" (func (;14;) (type 4)))
  (import "env" "strftime_l" (func (;15;) (type 8)))
  (import "env" "__sys_open" (func (;16;) (type 4)))
  (import "env" "__sys_fcntl64" (func (;17;) (type 4)))
  (import "env" "__sys_ioctl" (func (;18;) (type 4)))
  (import "wasi_snapshot_preview1" "fd_read" (func (;19;) (type 6)))
  (import "env" "__sys_fstat64" (func (;20;) (type 3)))
  (import "env" "__sys_stat64" (func (;21;) (type 3)))
  (import "wasi_snapshot_preview1" "fd_sync" (func (;22;) (type 1)))
  (import "env" "__sys_ftruncate64" (func (;23;) (type 6)))
  (import "env" "time" (func (;24;) (type 1)))
  (import "env" "__localtime_r" (func (;25;) (type 3)))
  (import "env" "__sys_getdents64" (func (;26;) (type 4)))
  (import "env" "setTempRet0" (func (;27;) (type 0)))
  (import "env" "_emscripten_throw_longjmp" (func (;28;) (type 14)))
  (import "env" "invoke_viiii" (func (;29;) (type 9)))
  (import "env" "getTempRet0" (func (;30;) (type 13)))
  (import "env" "invoke_iii" (func (;31;) (type 4)))
  (import "env" "invoke_iiiii" (func (;32;) (type 8)))
  (import "env" "invoke_v" (func (;33;) (type 0)))
  (import "env" "invoke_iiii" (func (;34;) (type 6)))
  (import "env" "__sys_unlink" (func (;35;) (type 1)))
  (import "env" "__sys_rmdir" (func (;36;) (type 1)))
  (import "env" "__gmtime_r" (func (;37;) (type 3)))
  (import "env" "invoke_viii" (func (;38;) (type 7)))
  (import "env" "invoke_vi" (func (;39;) (type 2)))
  (import "env" "invoke_ii" (func (;40;) (type 3)))
  (import "wasi_snapshot_preview1" "fd_seek" (func (;41;) (type 8)))

Ex. trivial standalone wasm built by emscripten in wazero https://github.com/tetratelabs/wazero/tree/main/emscripten/testdata it doesn't have a lot of imports because of the code and also that it isn't targeting web/js

https://github.com/tetratelabs/wazero/blob/main/Makefile#L79-L104

jerbob92 commented 2 years ago

@codefromthecrypt They are working on a standalone build (see https://github.com/bblanchon/pdfium-binaries/actions/runs/3081803559). The imports look like this now:

  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (import "wasi_snapshot_preview1" "fd_fdstat_get" (func (;1;) (type 3)))
  (import "env" "emscripten_notify_memory_growth" (func (;2;) (type 0)))
  (import "wasi_snapshot_preview1" "clock_time_get" (func (;3;) (type 24)))
  (import "wasi_snapshot_preview1" "fd_close" (func (;4;) (type 1)))
  (import "wasi_snapshot_preview1" "fd_write" (func (;5;) (type 6)))
  (import "wasi_snapshot_preview1" "fd_seek" (func (;6;) (type 80)))
  (import "wasi_snapshot_preview1" "environ_sizes_get" (func (;7;) (type 3)))
  (import "wasi_snapshot_preview1" "environ_get" (func (;8;) (type 3)))
  (import "wasi_snapshot_preview1" "fd_read" (func (;9;) (type 6)))
  (import "env" "__sys_mprotect" (func (;10;) (type 4)))
  (import "env" "__sys_madvise1" (func (;11;) (type 4)))
  (import "env" "__sys_fstat64" (func (;12;) (type 3)))
  (import "env" "__sys_stat64" (func (;13;) (type 3)))
  (import "wasi_snapshot_preview1" "fd_sync" (func (;14;) (type 1)))
  (import "env" "__sys_ftruncate64" (func (;15;) (type 6)))
  (import "env" "__sys_getpid" (func (;16;) (type 12)))
  (import "env" "__sys_getdents64" (func (;17;) (type 4)))
  (import "env" "setTempRet0" (func (;18;) (type 0)))
  (import "env" "_emscripten_throw_longjmp" (func (;19;) (type 14)))
  (import "env" "invoke_viiii" (func (;20;) (type 9)))
  (import "env" "getTempRet0" (func (;21;) (type 12)))
  (import "env" "invoke_iii" (func (;22;) (type 4)))
  (import "env" "invoke_iiiii" (func (;23;) (type 8)))
  (import "env" "invoke_v" (func (;24;) (type 0)))
  (import "env" "invoke_iiii" (func (;25;) (type 6)))
  (import "env" "__sys_unlink" (func (;26;) (type 1)))
  (import "env" "__sys_rmdir" (func (;27;) (type 1)))
  (import "env" "invoke_viii" (func (;28;) (type 7)))
  (import "env" "invoke_vi" (func (;29;) (type 2)))
  (import "env" "invoke_ii" (func (;30;) (type 3)))

Does that mean that we have to implement these import ourself? (except for wasi_snapshot_preview1)

jerbob92 commented 2 years ago

@codefromthecrypt I have implemented dummy methods and that seems to work. However, I'm missing the allocator, which is provided in the JS file of the WASM build of pdfium, so it's quite hard to do memory management in Wazero. Do you perhaps have a default implementation for allocate/deallocate?

jerbob92 commented 2 years ago

Ah, they are called malloc and free. It looks like I'm getting somewhere, nice!

Edit: So far it's doing a lot when I call FPDF_InitLibrary/FPDF_LoadMemDocument, however, eventually it's running into wasm error: unreachable. So I'm not there yet. Hard to debug.

codefromthecrypt commented 2 years ago

sure is hard to debug. you can try https://pkg.go.dev/github.com/tetratelabs/wazero/experimental/logging but it might be too much volume

you can also open a thread on #wazero https://gophers.slack.com/ if more convenient than ping/pong here

I feel like the build could probably be customized to use emscripten's allocator. it is inefficient to use the host for this. here's an example of the allocation we use in tinygo which could be ported to be host exports from wazero. the same general approach is needed https://github.com/tetratelabs/tinymem/blob/main/exports.go

I suppose there's two ways which is to try to make the emscripten imports work host side, or to try to customize the build so that they aren't needed. I feel like the latter will be more sustainable.

Regardless, if you can put open a playground repo with progress so far, I can help on the whack-a-mole.

gedw99 commented 2 years ago

thanks @codefromthecrypt for all the useful info. Really useful to learn all these aspects.

jerbob92 commented 2 years ago

Status update:

gedw99 commented 2 years ago

thanks @jerbob92 for the really detailed status update.

I have used this : https://github.com/hack-pad/hackpadfs Its a golang file system API that supports WASM, Normal and S3.

When used with WASM, it creates the file system using the browsers indexDB. The same APi can be used as i said for S3 and normal fs like is you were using it on a desktop or server. I guess the best name for that is Native :)

But i figured its worth mentioning HackPadFS, as it might help with the FS blocker..

Also the developer has build a Golang compiler and Editor in Golang that compiles to WASM and runs in a browser. Its nuts but interesting. It does work, but of course is rather slow. I have not tried it with Tinygo compilation. Quite a good acid test - a golang compiler that compiles golang inside a browser, all build with golang.. Turtles all the way down as they say.

jerbob92 commented 2 years ago

@gedw99 thinks for the link! However, I don't think that's usable in this case. The problem here is that Emscripten is trying to deal with the filesystem itself, so it's not allowing any outside implementation of the filesystem when compiled programs use open() or other fs related calls.

I'm already a bit further now:

I'm confident we will be able to fix this urandom issue, but I have the feeling this is the first hurdle in a lot of hurdles. For example: once urandom works, and it uses the in-memory filesystem, how would pdfium load the fonts that it needs?

gedw99 commented 2 years ago

with you about a nest of issues after issues..

Its a shame that pdfium is not written so that it just does compute on data fed to it and sedns back a result. No FS; No networking. The options your finding are basically doing that - for example making the FS run in memory.

Regarding Fonts, I wonder if there is an option to pass in the font files ( rather than a FS path) and it then holds them in memory ( since the memory is representing the FS now ) ? I am really clueless about pdfium though and just brainstorming with you.

gedw99 commented 2 years ago

I also want to say that before i started using go-pdfium i searched for a pure golang lib so that i knew i would not get trapped in this situation. I did not search very long and quickly use this lib.

I know there is a replacement for harfbuzz that is now in golang (https://github.com/benoitkugler/textlayout ). Does pdfium use harfbuzz ? Seems it does like all google projects: https://groups.google.com/g/pdfium/c/pwtg4PBZekU

jerbob92 commented 2 years ago

Well it might be possible to make Emscripten add the fonts directory when compiling pdfium so that it automatically loads it when the memory FS is started. But that still makes it impossible to open/write PDF files from/to a path, so I would prefer if there would be a full WASI backend for the Emscripten WasmFS so that we can just handle the FS calls from Go and make use of the os.FS implementation that Wazero already has. For now I'm focussing on just getting it to work though, so that I can do a performance benchmark between single-threaded go-pdfium, multi-threaded go-pdfium (gRPC plugins) and wasm go-pdfium.

Regarding the fonts I don't think it would be a problem to use the fonts, right now it would just be a problem that it doesn't have any fonts to load in the default memory FS of Emscripten.

jerbob92 commented 2 years ago

One step closer :dancers:

It appears you have to call _initialize when there is no main method (like in PDFium) before doing anything else to make Emscripten set some things up.

Getting a different error now though but we're getting somewhere.

gedw99 commented 2 years ago

thanks for deep diving this. Its a learning experience for me... Is there a branch in github to watch the action yet ?

jerbob92 commented 2 years ago

Well actually the new error is a big problem. Yes, I got WasmFS to behave correctly now, it's able to write to stdout and stderr, and it's able to open /dev/urandom. However. Emscripten Standalone doesn't even provide an entropy source, so it will still fail when trying to get random data: https://github.com/emscripten-core/emscripten/blob/f5a1916484da9b2dfb4242237f8fb7b29d42c501/system/lib/standalone/standalone.c#L117

Only way to fix this is to provide getentropy in the pdfium build, or create our own wasi WasmFS backend, I think I will create the backend because it will also fix our issues with PDF and font loading from files, and probably also other people will need it too.

I have my current code placed here: https://github.com/jerbob92/go-pdfium-wasm It contains the PDFium WASM that I currently use with debug symbols, It also contains a patch agains pdfium-binaries master branch to make that build.

codefromthecrypt commented 2 years ago

Only way to fix this is to provide getentropy in the pdfium build, or create our own wasi WasmFS backend, I think I will create the backend because it will also fix our issues with PDF and font loading from files, and probably also other people will need it too.

I wonder how bad or possible it would be to patch emscripten to use the wasi import wasi_snapshot_preview1 random_get as that's usually what's used for entropy. Ex wazero uses the same source for this and also when using the ultra-slow go compiled GOOS=js (crypto.getRandomValues)

jerbob92 commented 1 year ago

Yeah good idea, will look into that.

jerbob92 commented 1 year ago

@codefromthecrypt that works. But now we're up to the next error :crying_cat_face:

[FATAL:partition_bucket.cc(618)] Check failed: adjusted_next_partition_page + slot_span_reservation_size <= root->next_partition_page_end.

jerbob92 commented 1 year ago

I actually just found this: https://github.com/emscripten-core/emscripten/issues/14459 Pretty interesting read and basically tells us that a PDFium WebAssembly build will not work without patching PDFium itself because it uses features that emscripten does not support

gedw99 commented 1 year ago

Hey @jerbob92 maybe a different approach could also be taken .

We could run pdfium ( wasm ) with Wazero and wrap golang calls into it via host functions or similar .

https://github.com/tetratelabs/wazero/issues/601 Seems to indicate that Wazero does support emscripten based wasm.

You could probably also use this technique for browser by running pdfium ( wasm ) inside a Web Worker and then use tinygo to compile the golang wrapper function and run them outside the WebWorjer in the normal Browser Window.

So the architectural topology for Browser and Server ( with Wazero ) is similar.

Just an idea . I stumbled across this solution when playing around with other another lib . It was not for Emscripten aspect, but was to enable quasi threading by using Workers ( single thread ) with a Controller managing sending work to be done to each Worker.

A Bud that works for the Browser target and Wazero target is another thing I am planning to work on so that this Architecture is easier to use.

curious what you and @codefromthecrypt think of this approach .

It side steps the problem of many languages needing to compile together by isolating them using process barriers ( not a great word to describe it I know ).