Explanation of architecture

rogchap / v8go

Execute JavaScript from Go

https://rogchap.com/v8go

BSD 3-Clause "New" or "Revised" License

3.17k stars 210 forks source link

Explanation of architecture #129

Open mitar opened 3 years ago

mitar commented 3 years ago

I would like to better understand how this package integrates V8 with Go. I could not find it in documentation, so I am making this question/documentation request issue.

How does V8 integrate with Go? Are threads used by V8 completely separate from threads used by Go? How you control how many threads are used? By the number of isolates you make? Is isolate always bound to one thread or does it move around?

I understand why V8 has isolates and contexts (isolate is one tab, while contexts can be multiple frames shown on that tab), but in the context of Go, I do not see a reason why would one use multiple contexts and not simply multiple isolates? Using multiple contexts seem that can introduce only problems by one context maybe blocking another context? Is it any easier to share data between contexts than it is between isolates? It looks to me like it all cases data goes first to Go and then back in?

Are isolates/context thread safe? How they work with goroutines? If I create one isolate, can I spawn contexts from different goroutines?

In my view, ideally, isolate should be tied to the current goroutine, so that it can also be moved between threads as Go runtime sees fit. Or even better, if contexts could be in each in its own (parallel) goroutine, while sharing one isolate. So my use case is to run some JS for every HTTP request (similar to server-side rendering) and spawning isolate for each request seems costly. On the other hand, if isolates are separate threads from regular Go threads, then it is tricky to know what is the pool of isolates I have to keep around and how large and so on. Then it becomes almost like I would have some background node processes I send things to for processing.

rogchap commented 3 years ago

Your questions are more about V8 and CGO than specifically the architecture of this project.

V8 is single threaded, so whichever goroutine the thread is running, will be the thread that the Isolate is tied too. In Go you have no control over the thread a goroutine is tied to; concurrency is not parallelism, so you can't run multiple contexts in goroutines using a single isolate.

Generally you do not communicate from one context to another; you would have to explicitly share the same context security key between them for that to happen.

You can only parallelize Isolates not Context, so for your use case you would have to spawn multiple Isolates per request/goroutine. You could always have an Isolate Pool to speed up the startup time.

mitar commented 3 years ago

Thanks for the answers.

So creating an isolate uses the current goroutine (from which you called v8go.NewIsolate()) or does it create a new goroutine for it?

Are isolates/context thread safe? If I access a reference to an isolate/context from multiple goroutines, can that work? Or do I need to use my own locking around it?

rogchap commented 3 years ago

Isolate will be created on the thread of the current goroutine, v8go does NOT create a new goroutine for the Isolate.

Isolates and Context are NOT thread safe, you will find that you may access Isolate/Context on multiple goroutines only if you run your Go process on one thread eg. via runtime.LockOSThread(); If you don't lock the thread you will get random panics as you can't control if the goroutine creates a new thread or not, to run.

mitar commented 3 years ago

Hm, does this mean that for a long-running isolate, it could happen (without using runtime.LockOSThread) that isolate gets migrated to another goroutine which then panics? Should then runtime.LockOSThread be called by this package automatically?

mitar commented 3 years ago

Isolate will be created on the thread of the current goroutine, v8go does NOT create a new goroutine for the Isolate.

But that means that when control returns to the current goroutine, v8 event loop is not running anymore? Or, how long does RunScript run, when does it return? Where all top-level statements in the script are executed? Or when there are no promises yet to be resolved, for example?

arctica commented 3 years ago

Isolate will be created on the thread of the current goroutine, v8go does NOT create a new goroutine for the Isolate.

Isolates and Context are NOT thread safe, you will find that you may access Isolate/Context on multiple goroutines only if you run your Go process on one thread eg. via runtime.LockOSThread(); If you don't lock the thread you will get random panics as you can't control if the goroutine creates a new thread or not, to run.

runtime.LockOSThread() does not lock the whole process to one thread. It only pins the calling goroutine to its current thread. It does not allow you to access an Isolate/Context from multiple goroutines. In fact I don't see any way how runtime.LockOSThread() could help with v8go and running stuff in multiple goroutines.

But that means that when control returns to the current goroutine, v8 event loop is not running anymore? Or, how long does RunScript run, when does it return? Where all top-level statements in the script are executed? Or when there are no promises yet to be resolved, for example?

When V8 returns (to v8go), it will not by itself start executing until called again. It does not wait for promises to resolve etc. When the code that is called finished executing, V8 returns control to the caller. Don't think of V8 in terms of an event loop. Think of contexts being datastructures inside a collection of those (isolate) on which you can run functions like executing a Script.

So the gist is isolates and contexts cannot be accessed by multiple goroutines. For the webserver scenario you probably want to go for a resource pool like sync.Pool of Isolates so only one goroutine accesses that Isolate at a given moment in time and create a new Context for each request (that's cheap). Make sure to close the Context at the end of the request because unfortunately v8go does not free anything allocated in that Context until it's closed.

mitar commented 3 years ago

Thanks. So how does one then run a long running event loop inside v8go?

dylanahsmith commented 3 years ago

Isolates and Context are NOT thread safe

Does this mean we can get rid of the use of v8::Locker in v8go.cc? Since this would just add unnecessary overhead if the isolate isn't used in parallel across goroutines.

rogchap commented 3 years ago

There are still a few APIs in V8 that are thread safe, for example iso.TerminateExecution() and I think the lock is still needed to support these APIs

dylanahsmith commented 3 years ago

v8::Isolate::TerminateExecution()'s documentation says:

This method can be used by any thread even if that thread has not acquired the V8 lock with a Locker object.

v8::Isolate::CancelTerminateExecution()'s documentation says the same thing.

arctica commented 3 years ago

v8::Isolate::TerminateExecution()'s documentation says:

This method can be used by any thread even if that thread has not acquired the V8 lock with a Locker object.

v8::Isolate::CancelTerminateExecution()'s documentation says the same thing.

I think older versions checked for an active lock in TerminateExecution. Anyways, there is a check in the HandleScope constructor [1] so locking is needed in most cases.

  // We do not want to check the correct usage of the Locker class all over the
  // place, so we do it only here: Without a HandleScope, an embedder can do
  // almost nothing, so it is enough to check in this central place.

[1] https://source.chromium.org/chromium/chromium/src/+/master:v8/src/api/api.cc;l=852

dylanahsmith commented 3 years ago

The HandleScope check will pass if !v8::Locker::IsActive() is true, which you can see from its implementation that it is true if locker was never used in the process. So using it or not is an all or nothing decision.

I tried removing all the Locker locker(iso); lines and it did uncover tests that tried to use the same context in parallel (e.g. tests like TestValueFormatting that creates the context then does t.Run in a loop and t.Parallel inside that)

arctica commented 3 years ago

The HandleScope check will pass if !v8::Locker::IsActive() is true, which you can see from its implementation that it is true if locker was never used in the process. So using it or not is an all or nothing decision.

Nice catch! I'm not sure if that's a trade-off worth making though. It would rule out finalizers unless one sends from the finalizer a message to the main goroutine of the isolate (which would need locking) and there somehow manages to free() these resources. Out of curiosity: did you profile how much the Locker is costing?

dylanahsmith commented 3 years ago

I'm not sure if that's a trade-off worth making though.

I mostly was just starting the discussion, since this is a more implicit part of the API. There are definitely trade-offs between performance and convenience with API decisions like this, so I wanted to make them more explicit.

I just noticed the overhead from reading the code. Benchmarking confirms my suspicion that does a lot of otherwise cheap v8go calls, such as trying to iterate a JS array using Object.GetIdx, where it seemed to take about 3 times as long with the locking overhead. I'll open a PR with a benchmark. Even after that, there is another order of magnitude difference when compared to a C++ implementation.

It would rule out finalizers

It wouldn't rule it out, but it might influence what synchronization is relied upon to implement finalizers. I don't think this is the right granularity you would want for synchronization for the finalizer, since it could cause a lot of lock contention between JS running in various isolates and the GC finalizer goroutine. Instead, I think you would want to effectively queue the freeing of values, then they can actually be freed when the isolate is being used, such as before or after running a script or performing microtasks.

arctica commented 3 years ago

I mostly was just starting the discussion, since this is a more implicit part of the API. There are definitely trade-offs between performance and convenience with API decisions like this, so I wanted to make them more explicit.

I just noticed the overhead from reading the code. Benchmarking confirms my suspicion that does a lot of otherwise cheap v8go calls, such as trying to iterate a JS array using Object.GetIdx, where it seemed to take about 3 times as long with the locking overhead. I'll open a PR with a benchmark. Even after that, there is another order of magnitude difference when compared to a C++ implementation.

It's definitely an interesting discussion. I also did a quick benchmark of my version which in addition to using the V8 Locker also does some locking inside Go around Value allocations and saw results that did surprise me. The lock inside Go had not much impact at all while the Locker in V8 made a loop in which I allocate string Values about 30% slower. It's just one thread so there's no lock contention. Reading the source of the Locker I noticed that there is a fast and slow path for the lock. The fast path is taken when the current thread already locked the Isolate. There are two problems for us: locks are usually held only briefly and goroutines migrate from thread to thread. I did an experiment in which I grabbed the lock when creating the isolate without releasing it and pinned he goroutine to its current thread. That made the overhead of the V8 Locker pretty much disappear. I don't see an obvious way to leverage that fact though. Pinning the goroutine is easy but holding on to a Locker doesn't make much sense apart from maybe exposing it to the user so he can speed up a tight loop.

func (iso *Isolate) Locked(fn func()) {
    runtime.LockOSThread()
    iso.Lock()
    fn()
    iso.Unlock()
    runtime.UnlockOSThread()
}

// user code
iso.Locked(func() {
    for ... {
        // lots of calls to V8
    }}
})

It wouldn't rule it out, but it might influence what synchronization is relied upon to implement finalizers. I don't think this is the right granularity you would want for synchronization for the finalizer, since it could cause a lot of lock contention between JS running in various isolates and the GC finalizer goroutine. Instead, I think you would want to effectively queue the freeing of values, then they can actually be freed when the isolate is being used, such as before or after running a script or performing microtasks.

Yes, that's why a put a caveat after my statement and I think we ment the same thing, I just didn't express it as clearly I guess. In the end that would result in the most efficient implementation without using the V8 Locker. The only tricky situation arises when there are no further calls that would trigger this semi-automatic garbage collection. A method iso.GC() could help.

arctica commented 3 years ago

I've removed all uses of Locker and any other use of a Mutex in Go to experiment and have run into strange issues with not so commonly used features in V8. For example, using the --expose-gc flag in combination with a benchmark that allocates and frees Contexts in a loop resulted in V8 printing "Error installing extension 'v8/gc'" and not being able to allocate new Contexts after a while. This happened even though I was only running the benchmark in a single goroutine. It does not happen when I put the Locker in place.

Has anyone else seen this or similar issues? It makes me wonder if enough projects in the wild are using V8 without using the Locker. I know that internally V8 uses threads for various things so maybe there are bugs that were just never exposed because they assume a Locker is always used.

dylanahsmith commented 3 years ago

Note that cgo effectively locks the OS thread to run the C code on the thread stack, so we don't need to worry about goroutine migration during those C calls.

You are right about the v8::Locker having a fast path for when the lock is already taken. The reason why you need runtime.LockOSThread() around locking the isolate in your Isolate.Locked function, is because v8::Locker uses a thread id to keep track of which thread owns the lock so a goroutine migration while running the Go code would indeed be unsafe while holding the v8::Locker lock.

The other way v8 relies on thread-local state is to enforce a stack limit. It turns out that v8::Locker automatically updates the stack limit when taking the lock. Just removing the use of v8::Locker results in v8 stack overflow errors after a goroutine migration to another thread with higher stack addresses, so we would need to instead replace use of v8::Locker with use of v8::Isolate::SetStackLimit based on the current stack position (GetCurrentStackPosition()'s implementation shows how v8 gets the current stack position). Note that this v8 stack overflow error is quite subtle at the moment, since for some reason it has an empty error message which v8go doesn't currently handle well.

arctica commented 3 years ago

@dylanahsmith yea I was aware of the threading stuff, that's why I mentioned the fast path and using LockOSThread(). I've done some more experimentation regarding locking and ended up with 3 implementations:

No locking, this only works if pinning the goroutine to the OS thread. I don't think the issue is just stack limits, in fact the docs say one must hold a Locker when calling SetStackLimit(). I've seen TLS usage in various places in V8, it just can't deal with calls from multiple threads even if not done in parallel.
Using a stack allocated Locker as usual in C plus some pseudo recursive locking on the Go side of things.
Allocating a V8::Locker on the heap early from Go and releasing it from Go when done. The goroutine needs to be pinned to the OS thread while the lock is held.

Solution 1 is obviously the fastest but of course prevents one from using multiple goroutines for a given Isolate. The advantage is very pronounced (e.g. up to 3x) when in a tight loop juggling Values. But when allocating Contexts the advantage is not very noticable as their overhead dominates. Not a good fit if one wants to use finalizers.

Solution 2 and 3 have a noticable speed difference, 3 being up to 30% slower. Why would one use solution 3 anyways? Well, it theorically allows for full thread safety. Unfortunately I didn't see an easy way using solution 2 for this because the locking inside Go can't be done in a recursive manner. Go doesn't provide recursive mutexes and doesn't expose the id of a goroutine without ugly and slow hacks. Recursive mutexes would be needed because a function call which holds a lock can call back into Go code which would try again to grab the lock. V8's Locker is recursive so solution 3 covers all needs. Solution 2 only allows one main goroutine for a given Isolate plus other goroutines that call strictly only functions that are guaranteed to not call anything that will try to grab the mutex. I used this for finalizers the free V8 resources.

dylanahsmith commented 3 years ago

in fact the docs say one must hold a Locker when calling SetStackLimit()

It says "If you are using threads then you should hold the V8::Locker lock" since the code isn't thread-safe to call concurrently from multiple threads.

I've seen TLS usage in various places in V8, it just can't deal with calls from multiple threads even if not done in parallel.

The v8::Locker was explicitly added for the purpose of supporting calls from multiple threads when not done in parallel, so it is already constrained in what way this is done. For instance, it is free to use thread-local state while it would expect v8::Locker to be held, such as during script execution, but can't assume the state will remain after it is unlocked before the next call that requires v8::Locker to be used.

It also can't rely on thread_local state to be for the same isolate after the v8::Locker is unlocked, since multiple isolates could be used in the same thread. This is why SetStackLimit() also says "you must set a non-default stack limit separately for each thread", because this stack limit is stored in this Isolate::PerIsolateThreadData. Storing a reference to the an address on the stack is unusual, so it is more of an edge case that we need to workaround. Re-using other per-thread state shouldn't be as much of a problem. With v8::Locker we could even end up with the opposite problem if we want to expose a way to set per-thread state and not have it get reset when v8::internal::ThreadManager::RestoreThread automatically restores this per-thread state on Locker::Initialize.

Of course, let me know if you did experience issues using SetStackLimit and not using v8::Locker or know of something specific that will be a problem based on what you have seen from the v8 code.

But when allocating Contexts the advantage is not very noticable as their overhead dominates.

Allocating a new isolate or context would be amortized over how long they are used for, so it isn't really comparable at the library level, since either could be a bottleneck.

Solution 1 is obviously the fastest but of course prevents one from using multiple goroutines for a given Isolate.

It wouldn't prevent that, they would just need to use synchronization around it. For instance, an isolate could be sent to another goroutine over a channel, which is idiomatic for Go.

Unfortunately I didn't see an easy way using solution 2 for this because the locking inside Go can't be done in a recursive manner.

Why is it necessary to support recursive locking? I would expect callbacks to the function template to just not do any explicit locking.

arctica commented 3 years ago

It says "If you are using threads then you should hold the V8::Locker lock" since the code isn't thread-safe to call concurrently from multiple threads.

Unfortunately the docs don't state the reason for holding the Locker. The Locker has two effects: 1. exclusion from parallel execution 2. updating internal data when a thread switch has been detected (e.g. all the stuff in ThreadManager::RestoreThread()). If the reason for holding the lock was to ensure 1. then it would be fine to skip Locker usage but I'd hesitate to rely on it if not explicitly mentioned in the docs even if the actual implementation allowed for it as that might change over time.

The v8::Locker was explicitly added for the purpose of supporting calls from multiple threads when not done in parallel, so it is already constrained in what way this is done. For instance, it is free to use thread-local state while it would expect v8::Locker to be held, such as during script execution, but can't assume the state will remain after it is unlocked before the next call that requires v8::Locker to be used.

Yes, at first this was not obvious to me and I thought the Locker was just acting as a mutex.

It also can't rely on thread_local state to be for the same isolate after the v8::Locker is unlocked, since multiple isolates could be used in the same thread. This is why SetStackLimit() also says "you must set a non-default stack limit separately for each thread", because this stack limit is stored in this Isolate::PerIsolateThreadData. Storing a reference to the an address on the stack is unusual, so it is more of an edge case that we need to workaround. Re-using other per-thread state shouldn't be as much of a problem. With v8::Locker we could even end up with the opposite problem if we want to expose a way to set per-thread state and not have it get reset when v8::internal::ThreadManager::RestoreThread automatically restores this per-thread state on Locker::Initialize.

Of course, let me know if you did experience issues using SetStackLimit and not using v8::Locker or know of something specific that will be a problem based on what you have seen from the v8 code.

ThreadManager::RestoreThread() is called when a Locker lock is acquired and initializes thead data if it detects a new thread. Using just SetStackLimit() would skip this initialization.

I wont attempt to use SetStackLimit() because I don't believe V8 guarantees that it's fine with execution jumping between threads even when using SetStackLimit() for the above example. The Locker docs state "the v8::Locker and v8::Unlocker classes must be used to signal thread switches to V8." and that is a hard requirement. So either an Isolate stays on a given OS thread or Locker is used. I don't think there is room for anything in between.

Allocating a new isolate or context would be amortized over how long they are used for, so it isn't really comparable at the library level, since either could be a bottleneck.

Yep, most programs wont allocate contexts in a loop. I just mentioned by the side.

Solution 1 is obviously the fastest but of course prevents one from using multiple goroutines for a given Isolate.

It wouldn't prevent that, they would just need to use synchronization around it. For instance, an isolate could be sent to another goroutine over a channel, which is idiomatic for Go.

To my understanding this is not correct. As soon as multiple goroutines are at play you are virtually guaranteed to have multiple OS threads. Keep in mind, solution 1. completely avoids using V8::Locker and as discussed above this means an isolate can never migrate between OS threads (unless the user explicitly signals the thread switch to V8 via Locker) and there is no way to make two goroutines lock to the same OS thread, even if not at the same time. Using a channel to send the isolate to another goroutine is not sufficient.

Why is it necessary to support recursive locking? I would expect callbacks to the function template to just not do any explicit locking.

The locking in my implementation is done implicitly inside the library and not left to the library user. Every call into V8 needs to be synchronized (ignoring solution 1) and there can be for example a call executing a script which then calls back into Go which again calls into V8 with the most obvious example being the Go callback allocating some Value to pass back as a return value. The library doesn't know if it already entered the isolate in the current goroutine. I am sure the V8::Locker was done recursively for this reason but unfortunately the Godfathers of Go decided one should not have access to low level stuff like a groutine ID and that's ok as long as one is purely in Go land but it leads to issues when CGO comes into play and the C library relies on these low level features.

Now if I left it up to the user of the library to ensure proper synchronization then yes, he would just not do any explicit locking when calling again into V8. This could be easily done by using the mode from solution 1 plus exposing the Lock/Unlock() methods from solution 3 which use the V8::Locker. Then the user can also hand off an Isolate to a different goroutine.

dylanahsmith commented 3 years ago

The Locker docs state "the v8::Locker and v8::Unlocker classes must be used to signal thread switches to V8."

Actually, that was hidden in the v8::Unlocker documentation. Nice find. Implementation wise, it looks easy enough for v8 to support signalling thread switches without using v8::Locker, but now it is clear that this should be proposed as a new v8 feature.

The locking in my implementation is done implicitly inside the library and not left to the library user.

That's the case with the current implementation, so I thought you were talking about an explicit v8go.Isolate.Locked method for reducing locking overhead. In that case, v8::Locker could still be used implicitly to signal thread switching without explicit locking. Calling the method Locked would clearly indicate it can be used to synchronize use across threads but could also help in implying that v8go doesn't guarantee that thread-safety when that Locked method isn't used.

If for solution 3 you mean that this synchronization would be done implicitly, then it sounds like that would add additional calls from Go to C, so would add additional overhead compared to the current implementation for that full implicit thread-safety.

arctica commented 3 years ago

Actually, that was hidden in the v8::Unlocker documentation. Nice find. Implementation wise, it looks easy enough for v8 to support signalling thread switches without using v8::Locker, but now it is clear that this should be proposed as a new v8 feature.

Haha yea it was indeed in the Unlocker documentation and not Locker. I stumbled upon it by accident. I agree that it would be nice if V8 exposed a way to signal thread changes without requiring Locker. But right now we are either forced to use the Locker or make sure that threads are never changed.

That's the case with the current implementation, so I thought you were talking about an explicit v8go.Isolate.Locked method for reducing locking overhead. In that case, v8::Locker could still be used implicitly to signal thread switching without explicit locking. Calling the method Locked would clearly indicate it can be used to synchronize use across threads but could also help in implying that v8go doesn't guarantee that thread-safety when that Locked method isn't used.

Right. I would like to support both: implicit locking inside the library and explicit locking by the user via either a iso.Lock() method or iso.Locked(func()). When implictly locking, each call would use a stack allocated V8::Locker as usual and would be the default. A v8go.DisableLocking() or ManualLocking() method would disable this behaviour and hopefully make it clear enough that manual locking via iso.Lock()/Locked() would need to be used. When locking is disabled, each call to allocate an Isolate would lock the goroutine to the OS thread to prevent if from randomly migrating onto other threads via the Go scheduler. To hand an isolate to another goroutine would maybe be best done via a library provided mechanism that unlocks the old goroutine, locks the receiving one and signals the thread change to V8.

If for solution 3 you mean that this synchronization would be done implicitly, then it sounds like that would add additional calls from Go to C, so would add additional overhead compared to the current implementation for that full implicit thread-safety.

Solution 3 is actually usable for both implicit and explicit locking. There is additional overhead indeed in the implicit mode as there are many more cgo calls as you mentioned. For explicit locking I believe the overhead should be negligible. The advantage is that this way full thread safety in implicit locking mode can be implemented which I don't think is possible in any other way due to the problems with recursive locks. By full thread safety I mean including the Go parts which currently are not thread safe. I am not sure if this is desirable though as the overhead would be considerable indeed since even functions that don't call into V8 would need the locking. For now, I have not implemented this full thread safety.