vmware-archive / gpdb-sandbox-tutorials

http://greenplum-db.github.io/gpdb-sandbox-tutorials
BSD 2-Clause "Simplified" License
78 stars 44 forks source link

Pyspark tutorial? #18

Open zhang2jg opened 7 years ago

zhang2jg commented 7 years ago

Hi, I have gone through the tutorial and would like to try pyspark on hdfs. I notice pyspark is pre-installed (2.0.x). But it doesn't support the pre-installed python (version 2.6.6). To make it work, need to upgrade to newer version of python. I tried pip, yum to update python. But these commands are not recognized in the VM.

danielgustafsson commented 7 years ago

@skahler-pivotal do you know if thats doable in the image?

skahler-vmware commented 7 years ago

Not sure if that is doable in the image. pyspark must be coming down as part of Zepplin, but I imagine the fact that the system is still running RHEL6 is causing an issue.

I'll take this as another flag that the process needs to be upgraded to use RHEL7

There are a couple tutorials on upgrading or adding python2.7 that aren't that tough, but yum and some of the system stuff relies on 2.6 at the base. So you've got potential to mess that up. It is a VM though so probably not much lost if it does go sideways.

zhang2jg commented 7 years ago

Thanks! I figure I can use yum when login as root. Previously I login as gpadmin and couldn't use yum

skahler-vmware commented 7 years ago

gpadmin should be able to sudo to root in the VM