Closed wvwwvwwv closed 2 weeks ago
@wvwwvwwv thanks for this! I've been playing around with adjusting the internals of the client implementation.
I have been trying to switch to references where possible so you don't need to constantly be cloning the session. I'll change the existing Rwlock
from parking_lot
to tokio::sync::Rwlock
and incorporate your test
Thanks for fixing this so quickly!
One question that might not be related to this issue.
I noticed that SparkSession is boxed when passed to DataFrame, and I guess that's because you wanted to avoid memcpy. But as far as I understand, DataFrame 'consumes' SparkSession, so SparkSession won't be mem-copied, instead DataFrame will directly be constructed upon the supplied SparkSession if the current stack allows (didn't check it in the assembly level so not 100% sure). Apart from that, heap allocation is usually (unless certain conditions are met) quite expensive as compared to memcpy, so I think avoiding the use of Box
will be more beneficial in general.
-> Don't get me wrong, I was just curious when I saw Arc<SparkConnectServiceClient>
and Box<SparkSession>
because the use of heap allocation was frowned upon in my previous company :-(
Therefore, in my humble opinion, it would be nice to compare performance between memcpy vs Box. What's your opinion?
Yeah, great question!
I initially was modeling some of the components of this projects after DataFusion. Specifically I was looking at how they are creating a dataframe from their concept of the SessionContext
. This is where I saw they were using Box<>
as a means to help reduce size of allocations. The change to Box was in this DataFusion PR #10033.
Doing a similar size_of
test to see what the total allocation of DataFrame
results in something like this.
With Box<SparkSession>
:
size_of::<DataFrame>
: 232With SparkSession
:
size_of::<DataFrame>
: 1424This project isn't using nearly the same level of futures allocations, so I wasn't running into a stack overflow error like DataFusion was when they switched. I was just taking some pointers from their lessons learned.
I am working on adjusting some aspects of the DataFrame
so it doesn't take ownership of the SparkSession
when it is being created. I think users would be expecting to use the SparkSession
multiple times in one app without needing to call .clone()
if they want to use it again. The DataFusion SessionContext
uses &self
for nearly all its methods and can be used many times. I used clone()
originally as a way to just get things working, but have starting to rework some of the internals now.
Thanks for the answer! SparkSession being 1424 (oh...) totally justifies heap allocation.
And, I also think that using &self
does make more sense, given that a single DataFrame or SparkSession is often used multiple times to create different DataFrames.
Hi, thanks a lot for this nice crate!
I'd like to report a deadlock issue when a spark session is cloned and used concurrently.
https://github.com/sjrusso8/spark-connect-rs/pull/46 demonstrates a possible workflow leading to a deadlock.
The gist is, everywhere
#[allow(clippy::await_holding_lock)]
is used poses a possibility of resulting in a deadlock when a spark session is cloned and used concurrently.-> When a task is suspended holding a lock, another task will wait for the lock to be released without yielding the executor. -> This is a very common "dangerous" asynchronous programming pattern that should always be avoided at all cost.
Therefore, I would suggest that we remove the 'clone' method from
SparkSession
, or replacerwlock
with an asynchronous lock. What do you think of this?