rojo-rbx / rbx-dom

Roblox DOM and (de)serialization implementation in Rust
MIT License
105 stars 42 forks source link

Memory leak in SharedString #350

Closed cassanof closed 10 months ago

cassanof commented 11 months ago

I ran a memory profiler against my program, as it was leaking memory at the end of the program. I found the problem to be with how SharedString is allocated or deallocated. I think it may be a race condition in the Drop implementation: https://github.com/rojo-rbx/rbx-dom/blob/9d2418d834dd5e47dc2927a5860ba339942414ef/rbx_types/src/shared_string.rs#L101 . I'm not entirely sure; this problem only occurs when you parse concurrently.

cassanof commented 11 months ago

In my run: In total, 1,343,532 bytes were allocated for the SharedString. At the end of the program, there were 671,760 left.

cassanof commented 11 months ago

Looks like this is a known race condition with Arc::try_unwrap: https://doc.rust-lang.org/std/sync/struct.Arc.html#method.try_unwrap

cassanof commented 11 months ago

Can I make a PR?

Dekkonot commented 11 months ago

I would be willing to accept a PR for this, yeah. I'll be happy if it's easily solvable like that.

cassanof commented 11 months ago

That would fix it, however there is another unrelated issue i spoted (dangling pointer). I think the lock needs to be hoisted up (in that case, try_unwrap is fine, because there can't be more than 1 thread executing it). Here is the full fix I propose (pardon bad indentation):

impl Drop for SharedString {
    fn drop(&mut self) {
        // If the reference we're about to drop is the very last reference to
        // the buffer, we'll be able to unwrap it and remove it from the
        // SharedString cache.
            let mut cache = match STRING_CACHE.lock() {
                Ok(v) => v,
                Err(_) => {
                    // If the lock is poisoned, we should just leave it
                    // alone so that we don't accidentally double-panic.
                    return;
                }
            };
        if Arc::try_unwrap(self.data.take().unwrap()).is_ok() {

            cache.remove(&self.hash);
        }
    }
}

I know this makes it very inefficient, but it is necessary without making large changes. Here is why:

  1. Thread A is doing some stuff, allocates shared string "AAAA"
  2. Thread A is done done doing stuff, runs Drop on the string "AAAA". There are no more references, the current execution context is inside the if statement.
  3. At the same exact time, thread B allocates the shared string "AAAA", it gets the lock before thread A. The string is already allocated, so the Arc is given to the data. The lock is now freed.
  4. Thread A now gets the lock, and removes the hash from the cache.
  5. Thread B now has a dangling Arc

Double check that this is right, I don't work with this interning stuff very often :)

Dekkonot commented 11 months ago

That's quite inefficient, so I'm hesitant, but I think it's what we'll have to do, yeah. I'm unable to get this to happen during my own testing though; if you could produce a repro, it would help a lot with testing any fixes to it.

cassanof commented 11 months ago

if you could produce a repro, it would help a lot with testing any fixes to i

Yeah, I haven't seen this happen (the dangling ptr issue), just something I spotted. It's something that would very very rarely happen, but it could.

Do you want to just do the Arc::into_inner change?

Dekkonot commented 10 months ago

After giving this some more research, I don't believe this is an issue. By its nature, Arc uses atomic operations to perform reference counting and in particular guarantees that acquisitions are synchronized when checking the reference count. This is noted in the Drop implementation but you can see they use a similar mechanism in into_inner.

Definitely keep an eye out because I am not an expert but your concern seems unfounded.

cassanof commented 10 months ago

What are we talking about here? The dangling pointer issue or the Arc::try_unwrap issue? After the change I haven't encountered the memory leak, so it must have done something. The documentation explicitly states that two threads executing Arc::try_unwrap at the same time will cause a race condition: https://doc.rust-lang.org/std/sync/struct.Arc.html#method.try_unwrap

Dekkonot commented 10 months ago

Sorry, I should have clarified. I meant the dangling pointer problem. I'm fairly certain it cannot happen.