Bug in npartitions parameter ; spark backend

vepadulano / PyRDF

Python Library for doing ROOT RDataFrame analysis

https://pyrdf.readthedocs.io/en/latest/

9 stars 7 forks source link

Bug in npartitions parameter ; spark backend #89

Closed brianventura closed 3 years ago

brianventura commented 4 years ago

Dear developpers, I think I have spotted a bug in the partitioning of the dataframes done in PyRDF. Here is a simple snippet of example showing that : config: (I use the most recent PyRDF version in SWAN so using the zipfile and sys.path.insert as explained in the tutorial, but this should also work)

import PyRDF
PyRDF.use('spark', {"npartitions": 50})
spark

snippet:

r = PyRDF.RDataFrame(1000)
r = r.Define("nu","2.")

lala = PyRDF.RDataFrame(20)

print(r.Count().GetValue())
print(lala.Count().GetValue())
print(r.Sum("nu").GetValue())

Here is my result ; As I trigger the event loop thrice, 1st and 3rd for 'r' and 2nd for 'lala' it seems that the partition number of lala overrides the one of r in the 3rd event loop...And I do not think that is expected.

Thanks a lot for helping, and best regards, Brian

vepadulano commented 4 years ago

Dear @Zeguivert , Sorry for the late reply. Yest the behaviour you are showing is expected:

npartitions is an attribute of the backend, in fact you set it by calling PyRDF.use('spark', {"npartitions": 50})
At the same time it is also linked to the number of clusters in the TTree itself, that is the unit of parallelization PyRDF relies on. In general it wouldn't make sense to use more partitions than available clusters because that could be detrimental to performance.

So in practice this means that you set npartitions at the beginning of the program but then PyRDF internally checks that value to be equal or lesser than the number of clusters, otherwise it sets it to the number of clusters.

The issue I see with your example is that the warning is not printed, that may be because it gets printed on the workers and not on the driver, to be seen. Cheers, Vincenzo

vepadulano commented 3 years ago

Dear @Zeguivert , Actually the snippet you provide doesn't use TTrees at all so the npartitions parameter you set at the beginning should not be influenced by the creation of a new dataframe. My previous comment is valid when you're dealing with multiple dataframes each constructed on a different TTree (in that case different TTrees may have different number of clusters and we go back to what I wrote in the comment). Just to be sure I tried your snippet on SWAN and in fact the number of tasks is always the same

That said, this behaviour has its flaws. In the future, PyRDF will link npartitions to the single RDataFrame rather than the whole distributed backend instance. This way you won't get your partitions changed under your feet without you noticing. For the moment I'm closing this issue

Thank you again for your report and patience!