Closed brianventura closed 4 years ago
@Zeguivert Thank you very much for your feedback, we are working to provide these features as soon as possible. @vepadulano just added Sum
(#77) and Count
(#74) support for Spark.
Keep in mind that in order to use it you will need to send a custom version of PyRDF to the Spark executors as explained on this tutorial, since it is not yet part of Bleeding Edge in SWAN.
Thanks a lot for those 2 features ! I am eagerly waiting for AsNumpy() as I also use it extensively in my analysis.
Hi @Zeguivert,
I inform you that Sum
, Count
and AsNumpy
features are now working on a Spark backend. The easiest way for you to check them out would be to log in to a SWAN session and follow the instructions in this tutorial to ship the latest PyRDF version to the Spark executors.
I remain at your disposal should you have any questions
Dear heroes, Thanks a lot for all your work ! I will use them wisely... May you warn me when they will be shipped in the next bleeding edge software stack (or LCG ?) In SWAN for simplicity ? It is really nice to have the possibility to use them right away though !
Dear @Zeguivert,
implementing AsNumpy
in PyRDF to work with Spark actually required some changes in ROOT. Those changes are already in the nightlies, so that you (fortunately) can find them already in bleeding edge SWAN. As for PyRDF, we would like to know if you find these changes suitable for your needs. In that case, we will proceed to tag a new release so that then it could be uploaded on the LCG releases.
@Zeguivert Did you have any chance to try out the new developments?
Sorry for my late reply. I tried to, but I think I did something wrong because it does not work. Indeed since there exists already a PyRDF module in LCG_96/bleeding edge (using the latter one for AsNumpy() ), I am not sure I imported the right version. (I put a sys.path.append(PyRDF_path where I git cloned it) before doing import PyRDF but I don't think it worked...
Maybe I should add something to my $PYTONPATH ?
Can you please run the following commands in a swan terminal and send us the output?
>>> import PyRDF
>>> print(PyRDF.__file__)
Yes I found this attribute 2 sec before your message ;) and Indeed the answer is /cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-centos7-gcc8-opt/lib/python2.7/site-packages/PyRDF/init.pyc
I modified the PYTHONPATH but it did nothing ...
Hi @Zeguivert, Indeed the demo doesn't take into account the fact that now PyRDF is on LCG so you actually need to prepend the custom version of PyRDF to sys.path. I'm updating the demo right now for future reference. Meanwhile, I suggest you follow these steps:
git clone
d PyRDF and while in the PyRDF directory type:
python setup.py install --user
import sys
sys.path.insert(0, "/path/to/custom/PyRDF")
import PyRDF
This should import the new PyRDF instead of the LCG one. The other steps are the same as in the current demo
Hi again, Thanks for the tip, I indeed tried the insert trick to prepend but I did not tried with the step 1. However by executing it in SWAN commandline I got the following
copying build/lib/PyRDF/Node.py -> build/bdist.linux-x86_64/egg/PyRDF
error: [Errno 1] Operation not permitted: 'build/bdist.linux-x86_64/egg/PyRDF/Node.py'
and if I redo the command it gives me the same error not for Node.py but different stuff each time ... quite weird...
Okay so I did a mistake. I put in sys.path.insert the parent folder of PyRDF and not PyRDF itself...now the module imported is the right one. Sorry for this stupid mistake of mine I am trying the notebook right now (however the first step you describe is still not working because of some permissions, but is it necessary ?)
Yeah probably it's not necessary, but there shouldn't be any permission errors nonetheless. The --user
option should serve exactly that purpose. Weird indeed
Ok so the new demo should look like this #82. Let us know if you find any more issues with the latest features. By the way, we are very close to support also Snapshot
with #80
Dear developpers, My notebook had a last issue but not because of the spark configuration or module. I already had this issue before so that I could debug it quickly ; since /eos is not mounted on spark clusters I had to add a root://eoscompass.cern.ch// before my rootfile. Also the latter contained other double slashes which were causing bugs also related to accessing the rootfile. This can also be mentioned somewhere in a documentation maybe (?).
As for Count, Sum, they seem to work. However for AsNumpy(), it does not seem to work (I am in bleeding edge). Here are the errors in a joint file
Any guess ?I also tried the method without parameters but it did not work either
Hi @Zeguivert,
Yes you have to provide the full remote path to access files on EOS from the Spark workers.
The errors you get from AsNumpy()
have to do with ROOT specifically. Indeed those are the reason why we first had to modify ROOT in order to make things work with PyRDF as I commented previously.
I rerun a simple tutorial that uses AsNumpy()
on my SWAN session using the following configuration:
Software stack: Bleeding edge Python 3
Platform: CentOS7 (gcc8)
Spark cluster: Cloud Containers (K8s)
I leave the other options as default. With this configuration I have an AsNumpy()
example working on the Spark clusters.
import sys
# Modify the path to point to your PyRDF directory
sys.path.insert(0, "/eos/user/v/vpadulan/pyrdf-dist-asnumpy/PyRDF")
sys.path
import PyRDF
import pandas
PyRDF.use("spark", {"npartitions": 10})
sc.addPyFile("./PyRDF.zip")
# Let's create a simple dataframe with ten rows and two columns
df = PyRDF.RDataFrame(10).Define("x", "(int)rdfentry_").Define("y", "1.f/(1.f+rdfentry_)")
# Next, we want to access the data from Python as Numpy arrays. To do so, the
# content of the dataframe is converted using the `AsNumpy` method. The
# returned object is a dictionary with the column names as keys and 1D numpy
# arrays with the content as values.
npy = df.AsNumpy()
print("Read-out of the full RDataFrame:\n{}\n".format(npy))
Output:
Read-out of the full RDataFrame:
{'x': array([4, 0, 6, 2, 8, 5, 3, 9, 7, 1], dtype=int32), 'y': array([0.2 , 1. , 0.14285715, 0.33333334, 0.11111111,
0.16666667, 0.25 , 0.1 , 0.125 , 0.5 ],
dtype=float32)}
Update: I rerun the tutorial another time, but this time on bleeding edge Python 2. I was able to reproduce your error. I'm currently investigating on that, seems to be related with how Python2 handles pickling of nested classes.
Yes sorry I should have said that I used python 2. I also had an issue yesterday regarding login to swan (solved thanks to Piotr) so that I could not test your code and answer right away... Anyway thanks for looking at this Python2 error !
Dear developpers, After a git pull, thanks to #83, it works now with python 2 ! It is really a pleasure to see that people are constantly enhancing this essential (to me) module.
Thanks a lot !
@Zeguivert cool! @vepadulano did a great work.
Please let us know if the set of features included so far work for you so we can go for a new release.
Hi @Zeguivert, Thank you for your kind reply! Keep us updated and have a good day!
@JavierCVilla every new feature works for me, so I guess you can go for a release if you want. But maybe one can wait to add Snapshot() and maybe Report() before doing it, as you wish ; on my side I can still use the zip file config until then.
Adding to this, the new RMergeableValue family of classes will bring compatibility with potentially all other RDataFrame features still missing. Closing this issue because it was already solved
(HI!) As a strong user of this module, it would be great for me to have access to those features by order of priority (that is why I did not created one issue per feature) :
Thanks a lot !