Closed brianventura closed 3 years ago
Dear @Zeguivert , Sorry for the late reply. Yest the behaviour you are showing is expected:
npartitions
is an attribute of the backend, in fact you set it by calling PyRDF.use('spark', {"npartitions": 50})
So in practice this means that you set npartitions
at the beginning of the program but then PyRDF internally checks that value to be equal or lesser than the number of clusters, otherwise it sets it to the number of clusters.
The issue I see with your example is that the warning is not printed, that may be because it gets printed on the workers and not on the driver, to be seen. Cheers, Vincenzo
Dear @Zeguivert ,
Actually the snippet you provide doesn't use TTrees at all so the npartitions
parameter you set at the beginning should not be influenced by the creation of a new dataframe. My previous comment is valid when you're dealing with multiple dataframes each constructed on a different TTree (in that case different TTrees may have different number of clusters and we go back to what I wrote in the comment).
Just to be sure I tried your snippet on SWAN and in fact the number of tasks is always the same
That said, this behaviour has its flaws. In the future, PyRDF will link npartitions
to the single RDataFrame rather than the whole distributed backend instance. This way you won't get your partitions changed under your feet without you noticing. For the moment I'm closing this issue
Thank you again for your report and patience!
Dear developpers, I think I have spotted a bug in the partitioning of the dataframes done in PyRDF. Here is a simple snippet of example showing that : config: (I use the most recent PyRDF version in SWAN so using the zipfile and sys.path.insert as explained in the tutorial, but this should also work)
snippet:
Here is my result ; As I trigger the event loop thrice, 1st and 3rd for 'r' and 2nd for 'lala' it seems that the partition number of lala overrides the one of r in the 3rd event loop...And I do not think that is expected.
Thanks a lot for helping, and best regards, Brian