rust-lang / jobserver-rs

Apache License 2.0
69 stars 40 forks source link

Fix potentially dropped `Client` in `configure` #40

Closed NobodyXu closed 2 years ago

NobodyXu commented 2 years ago

Fixed #25

Signed-off-by: Jiahao XU Jiahao_XU@outlook.com

alexcrichton commented 2 years ago

On second read, I don't know whether this is safe. If this attempts to destroy the Arc<T> closure in the pre_exec closure I don't think that this is safe.

NobodyXu commented 2 years ago

On second read, I don't know whether this is safe. If this attempts to destroy the Arc<T> closure in the pre_exec closure I don't think that this is safe.

Oops, I didn't think of that.

I would probably revert it and fix that with lifetime subtype/variance.

NobodyXu commented 2 years ago

On second read, I don't know whether this is safe. If this attempts to destroy the Arc<T> closure in the pre_exec closure I don't think that this is safe.

I've fixed that by using lifetime subtyping to ensure that &self always outlives &Command.

alexcrichton commented 2 years ago

I think a better solution might be to use some sort of leaking method in the forked process. I'm wary to change the public API here and I'm not actually sure that this fixes the original issue (while it probably fixes the one specific example I'm not sure that this fixes all usages).

NobodyXu commented 2 years ago

I think a better solution might be to use some sort of leaking method in the forked process. I'm wary to change the public API here and I'm not actually sure that this fixes the original issue (while it probably fixes the one specific example I'm not sure that this fixes all usages).

That sounds like a good idea.

I will do it tomorrow.

NobodyXu commented 2 years ago

@alexcrichton I've fixed this by leaking the Arc.

alexcrichton commented 2 years ago

Reading over this, can you confirm the behavior one way or another? This now actually seems like it'll legitimately leak the memory in the host process.

NobodyXu commented 2 years ago

@alexcrichton According to CommandExt::pre_exec's documentation:

This closure will be run in the context of the child process after a fork. This primarily means that any modifications made to memory on behalf of this closure will not be visible to the parent process.

So this should be fine.

alexcrichton commented 2 years ago

I'm more worried about the Command in the original parent process where the Arc is leaked I believe.

NobodyXu commented 2 years ago

I'm more worried about the Command in the original parent process where the Arc is leaked I believe.

Thanks for pointing out!

Turns out that I missed that one :D

NobodyXu commented 2 years ago

@alexcrichton This should now be fixed as I use Arc::increment_strong_count in the children to make sure that the Arc is leaked, but not in the parent.

alexcrichton commented 2 years ago

Looks good to me, can you expand in the comment why the arc is leaked though?

NobodyXu commented 2 years ago

Looks good to me, can you expand in the comment why the arc is leaked though?

Arc::increment_strong_count increments the strong count of the Arc by one, thus it is leaked unless we pair it with Arc::decrement_strong_count.

alexcrichton commented 2 years ago

Sorry yes I understand that this is being leaked, what I would like the code to document is why it's being leaked because in the abstract that's a pretty weird thing to do intentionally.

NobodyXu commented 2 years ago

what I would like the code to document is why it's being leaked because in the abstract that's a pretty weird thing to do intentionally.

Hmmm, Arc::increment_strong_count(p.0) is equivalent to the following code:

let arc = mem::ManuallyDrop::new(unsafe { Arc::from_raw(p.0) });
let _arc_clone: mem::ManuallyDrop<_> = arc.clone();
NobodyXu commented 2 years ago

Sorry yes I understand that this is being leaked, what I would like the code to document is why it's being leaked because in the abstract that's a pretty weird thing to do intentionally.

Basically, Arc uses reference counting to keep track of the references and it is dropped when the reference counting reaches 0.

Using Arc::increment_strong_count increment the reference counting by 1 without creating an Arc, thus prevents the reference counting from ever reaching 0 and being dropped.

NobodyXu commented 2 years ago

@alexcrichton I've updated the comment in unix::Client::configure.

alexcrichton commented 2 years ago

Sorry I think more things need to be added here:

NobodyXu commented 2 years ago

@alexcrichton Got it, I would check the 4th point tomorrow by looking into the std library.

NobodyXu commented 2 years ago

@alexcrichton I've checked the source code of Command::do_exec, it does not destructs the pre_exec function we pass to them.

However, the use of pre_exec itself prevents the use of posix_spawn, which can use vfork to avoid duplicating the virtual memory mapping, so I think we should avoid the use of pre_exec by creating the pipe without O_CLOEXEC flag.

NobodyXu commented 2 years ago

@alexcrichton I've added a test case to assert it works.

alexcrichton commented 2 years ago

The test is also not currently running because it is not #[test]

NobodyXu commented 2 years ago

The test is also not currently running because it is not #[test]

Since I modeled the tests after client-of-myself.rs, I just need to add harness = false to the Cargo.toml

I don't think #[test] would work here as we manually implement main.

NobodyXu commented 2 years ago

@alexcrichton The test client-dropped-before-command failed on windows, however I have no idea how to fix it. Can we merge this PR and then fix this in another PR?

alexcrichton commented 2 years ago

No I would prefer to not land a broken test that doesn't succeed on all platforms.

NobodyXu commented 2 years ago

No I would prefer to not land a broken test that doesn't succeed on all platforms.

Fair enough.

After looking at the mod windows, I think the only way to fix it is to rework the API. I would redesign the API then.

NobodyXu commented 2 years ago

@alexcrichton I figured it out, there actually isn't any problem with the my implementation of the API.

No I would prefer to not land a broken test that doesn't succeed on all platforms.

We actually do not support sharing the jobserver across processes on windows, but I did not disable this test for windows.

Not sure why client-of-myself.rs does not fail though.

NobodyXu commented 2 years ago

Ignore the previous comment, it was actually wrong.

I mistook wasm.rs for windows.rs...

NobodyXu commented 2 years ago

@alexcrichton I have fixed the CI failure, can you review this again please?

The latest commits fixed this by changing configure to configure_and_run, taking Command by value and requires a function f that performs the spawning.

configure_and_run requires user to perform spawning in f to ensure that Client must be alive at when spawning and once the function exits, it will clear any environment variable it sets.

This also avoids the use of std::os::unix::process::CommandExt to register pre_exec, meaning that on unix, Command::spawn can use posix_spawn to efficiently spawn a new process using vfork.

I also added support for tokio::process::Command behind a feature flag tokio so now people using tokio can also use this crate trivally without manually converting between std::process::Command and tokio::process::Command.

alexcrichton commented 2 years ago

Sorry this has grown to the point that I don't really know what's going on internally any more. If it's this difficult to fix the original issue then it may not be worth fixing or otherwise could try to just result in a better error or something like that.

NobodyXu commented 2 years ago

Sorry this has grown to the point that I don't really know what's going on internally any more. If it's this difficult to fix the original issue then it may not be worth fixing or otherwise could try to just result in a better error or something like that.

The original issue is caused by the API itself.

The original API says that we can configure a Command so that it can be spawned at anytime and inherit the jobserver. This is where everything goes wrong, Client can be dropped before Command is dropped.

Trying to clone and store Client in CommandExt::pre_exec only works on unix and it has a performance cost (it cannot use posix_spawn) and does not work on windows since it does not have pre_exec handle.

Thus I decided to redesign the API to better illustrate that relationship and avoid the performance cost.

In the new API, you have to spawn inside Client::configure_and_spawn and on exit, any environment set by it will also be cleared.

This works for std::process::Command and tokio::process::Command since spawning is synchronous for both of them.

alexcrichton commented 2 years ago

Yes I understand it's the API itself that's causing the issue. This is so significantly different from before that I don't want to pull it into this crate. If you'd like though you could maintain a separate crate which depends on this one.

NobodyXu commented 2 years ago

Yes I understand it's the API itself that's causing the issue. This is so significantly different from before that I don't want to pull it into this crate. If you'd like though you could maintain a separate crate which depends on this one.

Understandable, I might just fork this crate and create jobserver2.