Add existing features of RDataFrames in distributed PyRDF

brianventura commented 5 years ago

(HI!) As a strong user of this module, it would be great for me to have access to those features by order of priority (that is why I did not created one issue per feature) :

[x] 1. Sum()
[x] 2. Snapshot()
[x] 3. AsNumpy()
[x] 4. Count()
[ ] 5. Report()

Thanks a lot !

JavierCVilla commented 5 years ago

@Zeguivert Thank you very much for your feedback, we are working to provide these features as soon as possible. @vepadulano just added Sum (#77) and Count (#74) support for Spark. Keep in mind that in order to use it you will need to send a custom version of PyRDF to the Spark executors as explained on this tutorial, since it is not yet part of Bleeding Edge in SWAN.

brianventura commented 5 years ago

Thanks a lot for those 2 features ! I am eagerly waiting for AsNumpy() as I also use it extensively in my analysis.

vepadulano commented 5 years ago

Hi @Zeguivert, I inform you that Sum, Count and AsNumpy features are now working on a Spark backend. The easiest way for you to check them out would be to log in to a SWAN session and follow the instructions in this tutorial to ship the latest PyRDF version to the Spark executors. I remain at your disposal should you have any questions

brianventura commented 5 years ago

Dear heroes, Thanks a lot for all your work ! I will use them wisely... May you warn me when they will be shipped in the next bleeding edge software stack (or LCG ?) In SWAN for simplicity ? It is really nice to have the possibility to use them right away though !

vepadulano commented 5 years ago

Dear @Zeguivert, implementing AsNumpy in PyRDF to work with Spark actually required some changes in ROOT. Those changes are already in the nightlies, so that you (fortunately) can find them already in bleeding edge SWAN. As for PyRDF, we would like to know if you find these changes suitable for your needs. In that case, we will proceed to tag a new release so that then it could be uploaded on the LCG releases.

JavierCVilla commented 5 years ago

@Zeguivert Did you have any chance to try out the new developments?

brianventura commented 5 years ago

Sorry for my late reply. I tried to, but I think I did something wrong because it does not work. Indeed since there exists already a PyRDF module in LCG_96/bleeding edge (using the latter one for AsNumpy() ), I am not sure I imported the right version. (I put a sys.path.append(PyRDF_path where I git cloned it) before doing import PyRDF but I don't think it worked...

Maybe I should add something to my $PYTONPATH ?

JavierCVilla commented 5 years ago

Can you please run the following commands in a swan terminal and send us the output?

>>> import PyRDF
>>> print(PyRDF.__file__)

brianventura commented 5 years ago

Yes I found this attribute 2 sec before your message ;) and Indeed the answer is /cvmfs/sft-nightlies.cern.ch/lcg/views/dev3/Mon/x86_64-centos7-gcc8-opt/lib/python2.7/site-packages/PyRDF/init.pyc

I modified the PYTHONPATH but it did nothing ...

vepadulano commented 5 years ago

Hi @Zeguivert, Indeed the demo doesn't take into account the fact that now PyRDF is on LCG so you actually need to prepend the custom version of PyRDF to sys.path. I'm updating the demo right now for future reference. Meanwhile, I suggest you follow these steps:

Once you git cloned PyRDF and while in the PyRDF directory type:
```
python setup.py install --user
```
Now go to your notebook and type:
```
import sys
sys.path.insert(0, "/path/to/custom/PyRDF")
import PyRDF
```
This should import the new PyRDF instead of the LCG one. The other steps are the same as in the current demo

brianventura commented 5 years ago

Hi again, Thanks for the tip, I indeed tried the insert trick to prepend but I did not tried with the step 1. However by executing it in SWAN commandline I got the following

copying build/lib/PyRDF/Node.py -> build/bdist.linux-x86_64/egg/PyRDF
error: [Errno 1] Operation not permitted: 'build/bdist.linux-x86_64/egg/PyRDF/Node.py'

and if I redo the command it gives me the same error not for Node.py but different stuff each time ... quite weird...

brianventura commented 5 years ago

Okay so I did a mistake. I put in sys.path.insert the parent folder of PyRDF and not PyRDF itself...now the module imported is the right one. Sorry for this stupid mistake of mine I am trying the notebook right now (however the first step you describe is still not working because of some permissions, but is it necessary ?)

vepadulano commented 5 years ago

Yeah probably it's not necessary, but there shouldn't be any permission errors nonetheless. The --user option should serve exactly that purpose. Weird indeed

vepadulano commented 5 years ago

Ok so the new demo should look like this #82. Let us know if you find any more issues with the latest features. By the way, we are very close to support also Snapshot with #80

brianventura commented 5 years ago

Dear developpers, My notebook had a last issue but not because of the spark configuration or module. I already had this issue before so that I could debug it quickly ; since /eos is not mounted on spark clusters I had to add a root://eoscompass.cern.ch// before my rootfile. Also the latter contained other double slashes which were causing bugs also related to accessing the rootfile. This can also be mentioned somewhere in a documentation maybe (?).

As for Count, Sum, they seem to work. However for AsNumpy(), it does not seem to work (I am in bleeding edge). Here are the errors in a joint file

error.txt

Any guess ?I also tried the method without parameters but it did not work either

vepadulano commented 5 years ago

Hi @Zeguivert, Yes you have to provide the full remote path to access files on EOS from the Spark workers. The errors you get from AsNumpy() have to do with ROOT specifically. Indeed those are the reason why we first had to modify ROOT in order to make things work with PyRDF as I commented previously. I rerun a simple tutorial that uses AsNumpy() on my SWAN session using the following configuration: Software stack: Bleeding edge Python 3 Platform: CentOS7 (gcc8) Spark cluster: Cloud Containers (K8s) I leave the other options as default. With this configuration I have an AsNumpy() example working on the Spark clusters.

import sys
# Modify the path to point to your PyRDF directory
sys.path.insert(0, "/eos/user/v/vpadulan/pyrdf-dist-asnumpy/PyRDF")
sys.path

import PyRDF
import pandas

PyRDF.use("spark", {"npartitions": 10})
sc.addPyFile("./PyRDF.zip")

# Let's create a simple dataframe with ten rows and two columns
df = PyRDF.RDataFrame(10).Define("x", "(int)rdfentry_").Define("y", "1.f/(1.f+rdfentry_)")

# Next, we want to access the data from Python as Numpy arrays. To do so, the
# content of the dataframe is converted using the `AsNumpy` method. The
# returned object is a dictionary with the column names as keys and 1D numpy
# arrays with the content as values.
npy = df.AsNumpy()
print("Read-out of the full RDataFrame:\n{}\n".format(npy))

Output:

Read-out of the full RDataFrame:
{'x': array([4, 0, 6, 2, 8, 5, 3, 9, 7, 1], dtype=int32), 'y': array([0.2       , 1.        , 0.14285715, 0.33333334, 0.11111111,
       0.16666667, 0.25      , 0.1       , 0.125     , 0.5       ],
      dtype=float32)}

Update: I rerun the tutorial another time, but this time on bleeding edge Python 2. I was able to reproduce your error. I'm currently investigating on that, seems to be related with how Python2 handles pickling of nested classes.

brianventura commented 5 years ago

Yes sorry I should have said that I used python 2. I also had an issue yesterday regarding login to swan (solved thanks to Piotr) so that I could not test your code and answer right away... Anyway thanks for looking at this Python2 error !

brianventura commented 5 years ago

Dear developpers, After a git pull, thanks to #83, it works now with python 2 ! It is really a pleasure to see that people are constantly enhancing this essential (to me) module.

Thanks a lot !

JavierCVilla commented 5 years ago

@Zeguivert cool! @vepadulano did a great work.

Please let us know if the set of features included so far work for you so we can go for a new release.

vepadulano commented 5 years ago

Hi @Zeguivert, Thank you for your kind reply! Keep us updated and have a good day!

brianventura commented 5 years ago

@JavierCVilla every new feature works for me, so I guess you can go for a release if you want. But maybe one can wait to add Snapshot() and maybe Report() before doing it, as you wish ; on my side I can still use the zip file config until then.

vepadulano commented 4 years ago

Adding to this, the new RMergeableValue family of classes will bring compatibility with potentially all other RDataFrame features still missing. Closing this issue because it was already solved

vepadulano / PyRDF

Add existing features of RDataFrames in distributed PyRDF #76