Open eredzik opened 5 days ago
Thanks for trying out Sail and reporting the issue!
It seems that all data are loaded before the join happens. Broadcast join can be helpful to keep the small dataset in memory while the big one is read in a streaming fashion. Would you mind sharing some statistics of the datasets (e.g. in terms of number of rows and columns)? We'll see if there is any optimization we can do here.
BTW did you install Sail via pip
or build the package from source?
I've tried to call broadcast on smaller dataset but had some error - assumed it is not supported yet - will try tomorrow fixing possible code error on my part and include broadcast again.
Input dataset was quite wide with at least 200 columns and 6m rows.
I've installed it using pip, version 0.2.0.dev0
Yeah we do not support using broadcast()
to provide SQL hint yet. I mentioned broadcast join since I feel it could be the way we optimize such queries internally.
Could you share the output of df_res.explain()
for both Sail and Spark? I'd like to compare the physical plans to see if there is any interesting difference. (Feel free to mask column names if they are sensitive.)
I have no way to share those plans here unfortunately. One interesting thing I've found trying things out today is that runtime increases exponentially with amount of columns sourced from df1 (big).
I can try creating reproduction example using publicly available datasets in few next days :)
I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark. When join is not there then query runs within 10secs.
With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released. CPU and disk usage are very low during that time (barely nonexistent)
Query is as such in pseudocode:
Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?