sjrusso8 / spark-connect-rs

Apache Spark Connect Client for Rust
https://docs.rs/spark-connect-rs
Apache License 2.0
49 stars 11 forks source link

Deadlock: concurrent cloned spark sessions #47

Closed wvwwvwwv closed 2 weeks ago

wvwwvwwv commented 2 weeks ago

Hi, thanks a lot for this nice crate!

I'd like to report a deadlock issue when a spark session is cloned and used concurrently.

https://github.com/sjrusso8/spark-connect-rs/pull/46 demonstrates a possible workflow leading to a deadlock.

The gist is, everywhere #[allow(clippy::await_holding_lock)] is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.

-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor. -> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.

Therefore, I would suggest that we remove the 'clone' method from SparkSession, or replace rwlock with an asynchronous lock. What do you think of this?

sjrusso8 commented 2 weeks ago

@wvwwvwwv thanks for this! I've been playing around with adjusting the internals of the client implementation.

I have been trying to switch to references where possible so you don't need to constantly be cloning the session. I'll change the existing Rwlock from parking_lot to tokio::sync::Rwlock and incorporate your test

wvwwvwwv commented 2 weeks ago

Thanks for fixing this so quickly!

One question that might not be related to this issue.

I noticed that SparkSession is boxed when passed to DataFrame, and I guess that's because you wanted to avoid memcpy. But as far as I understand, DataFrame 'consumes' SparkSession, so SparkSession won't be mem-copied, instead DataFrame will directly be constructed upon the supplied SparkSession if the current stack allows (didn't check it in the assembly level so not 100% sure). Apart from that, heap allocation is usually (unless certain conditions are met) quite expensive as compared to memcpy, so I think avoiding the use of Box will be more beneficial in general.

-> Don't get me wrong, I was just curious when I saw Arc<SparkConnectServiceClient> and Box<SparkSession> because the use of heap allocation was frowned upon in my previous company :-(

Therefore, in my humble opinion, it would be nice to compare performance between memcpy vs Box. What's your opinion?

sjrusso8 commented 2 weeks ago

Yeah, great question!

I initially was modeling some of the components of this projects after DataFusion. Specifically I was looking at how they are creating a dataframe from their concept of the SessionContext. This is where I saw they were using Box<> as a means to help reduce size of allocations. The change to Box was in this DataFusion PR #10033.

Doing a similar size_of test to see what the total allocation of DataFrame results in something like this.

With Box<SparkSession>:

With SparkSession:

This project isn't using nearly the same level of futures allocations, so I wasn't running into a stack overflow error like DataFusion was when they switched. I was just taking some pointers from their lessons learned.

I am working on adjusting some aspects of the DataFrame so it doesn't take ownership of the SparkSession when it is being created. I think users would be expecting to use the SparkSession multiple times in one app without needing to call .clone() if they want to use it again. The DataFusion SessionContext uses &self for nearly all its methods and can be used many times. I used clone() originally as a way to just get things working, but have starting to rework some of the internals now.

wvwwvwwv commented 2 weeks ago

Thanks for the answer! SparkSession being 1424 (oh...) totally justifies heap allocation.

And, I also think that using &self does make more sense, given that a single DataFrame or SparkSession is often used multiple times to create different DataFrames.