Open kanirudhumar92 opened 7 years ago
Hi @Anirudh-zemoso,
Did you modify the Dockerfile to install Python 3 instead? The reason I ask is because it seems Snakebite is incompatible with Python 3, which is the primary reason that @puckel hasn't added Python 3 support to docker-airflow. At least, that's if I'm reading this right: https://github.com/puckel/docker-airflow/pull/74
Based on this snakebite issue, it looks like it may be a long time until Snakebite is upgraded, if it ever is: https://github.com/spotify/snakebite/issues/62
@puckel, I've been optimistically watching for a solution to the Python 3 issue (since I'd like to base my ETL system on your Airflow distribution). Do you think only way around the Snakebite dependency would be dropping HDFS support entirely? (I'd be in favor of that since I don't plan to use it, but I can see how this could be a problem for many users.)
Yes, i'm using python3 , Thanks, i'm not using hdfshook now..
Does this mean airflow currently does not work with python3? I am failing to intiailizedb with airflow installed this way.
@revolucion09 No, it's just the HDFS hook is not working with python3.
Even with this error message, as long as you are not using HDFS with airflow, it should be ok. I'm running Airflow on docker with python3 for a while now, everything works
I stumbled on another thing that doesnt work for the same reason. When using the TimeSensor in airflow, it actually imports the HDFSHook also. This import has a dependency on snakebite, which fails with Python3.
Traceback (most recent call last): File "/usr/lib/python3.5/site-packages/airflow/models.py", line 264, in process_file m = imp.load_source(mod_name, filepath) File "/usr/lib/python3.5/imp.py", line 172, in load_source module = _load(spec) File "
", line 693, in _load File " ", line 673, in _load_unlocked File " ", line 665, in exec_module File " ", line 222, in _call_with_frames_removed File "/home/airflow/airflow/dags/etl_hadoop_out_daily.py", line 6, in from airflow.operators.sensors import TimeSensor File "/usr/lib/python3.5/site-packages/airflow/operators/sensors.py", line 32, in from airflow.hooks.hdfs_hook import HDFSHook File "/usr/lib/python3.5/site-packages/airflow/hooks/hdfs_hook.py", line 20, in from snakebite.client import Client, HAClient, Namenode, AutoConfigClient File "/usr/lib/python3.5/site-packages/snakebite/client.py", line 1473 baseTime = min(time * (1L << retries), cap); ^ SyntaxError: invalid syntax
I will bug report this also on official repo...
This issue is open, but I see a commit https://github.com/puckel/docker-airflow/commit/87db6f5d788c78cf96c1792ebecc75f9e1bc9ea6 that changes the docker-airflow project to use Python-3. It runs, but the logs show an error as @c75 has pointed out. The error happens with the example dags too if you have those switched on. Running initdb causes them to load to the dagbag, which then throws that same error.
Sadly, I see a commit in the snakebite project 3 years ago that adds python3 support, yet subsequent commits have added code that is not python 3 compatible. I've added some notes there.
I am wanting to use s3 and EMR with airflow & python3, wondering if this is a showstopper for that.
In the airflow code, there is exception handling that sets a flag when snakebite is not installed, and as a consequence, code that is not actually using snakebite will not cause errors.
So basically i added a step in my Dockerfile that uninstalls snakebite right after installing airflow and the airflow modules I need. And that allowed me to use TimeSensor and probably the other sensors as well.
So that could work for you also on EMR. In theory, only the code dealing with HDFS actually needs Snakebite so I expect the other code to work under Python3 when snakebite is not installed.
Thanks for that suggestion, that gives me some hope before I give up on Python3 for Airflow. In my case I have no real need to interface directly with HDFS, as long as I can run spark submit. Meanwhile it seems like a shame that people go to the trouble of open sourcing a tool only to have people uninstall it because it's not kept modern. I know I could contribute to that project by modernising, but when most of the committers at Spotify have no incentive to keep it that way it seems futile. Maybe they will internally switch to Python3 at some point. Apart from legacy library support, there's really no justification anymore to stick with 2, Python3 is mature these days.
Maybe its possible to fork it and make a Python3 version, but its too much work to keep it up to date if Spotify is adding a lot of new code to Snakebite for Python2...
Anyway, Im pretty sure you will get it to work by simply uninstalling snakebite. If you run into some non-hdfs related thing that doesnt work, add it to this thread please so we all know. :)
I wonder if it would be possible to swap out the dependency of HDFSHook on snakebite in favor of an alternative HDFS client/wrapper?
This blog post from @wesm makes it sounds like the pure C++ libhdfs3, now part of Apache HAWQ, could be a candidate. Perhaps he would know more about whether this is a feasible idea.
http://wesmckinney.com/blog/python-hdfs-interfaces/
There's also an open discussion from 2015 on doing something like that in snakebite at https://github.com/spotify/snakebite/issues/145.
Edit: It looks like someone added a WebHDFSHook
with https://github.com/apache/incubator-airflow/pull/604 in 2015 which wraps hdfscli
. I'm not sure if this is a complete replacement for the other HDFS hook as both still seem to be maintained.
The HAWQ developers have advised us against relying on libhdfs3 for any production software. My understanding is that the best option continues to be the JNI-based libhdfs C library
Okay, so I'm trying to determine if the HdfsCLI Python package is built on lidbhdfs or something else. Or perhaps that it uses WebHDFS / HttpFS makes it not even require a native client locally.
https://hdfscli.readthedocs.io/en/latest/
(This is all pretty new to me.)
Didn't see it mentioned before so adding this https://pypi.org/project/snakebite-py3/
It seems that the internetarchive project is maintaining a new py3 version of snakebite, any plans of using it?
If you're upgrading airflow, unfreeze all your pip dependencies (then freeze them again). It's possible some of them are still pulling in snakebite.
[2017-03-28 14:53:07,932] {models.py:266} ERROR - Failed to import: /usr/local/lib/python3.5/site-packages/airflow/example_dags/example_http_operator.py Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/airflow/models.py", line 263, in process_file m = imp.load_source(mod_name, filepath) File "/usr/local/lib/python3.5/imp.py", line 172, in load_source module = _load(spec) File "", line 693, in _load
File "", line 673, in _load_unlocked
File "", line 673, in exec_module
File "", line 222, in _call_with_frames_removed
File "/usr/local/lib/python3.5/site-packages/airflow/example_dags/example_http_operator.py", line 20, in
from airflow.operators.sensors import HttpSensor
File "/usr/local/lib/python3.5/site-packages/airflow/operators/sensors.py", line 33, in
from airflow.hooks.hdfs_hook import HDFSHook
File "/usr/local/lib/python3.5/site-packages/airflow/hooks/hdfs_hook.py", line 20, in
from snakebite.client import Client, HAClient, Namenode, AutoConfigClient
File "/usr/local/lib/python3.5/site-packages/snakebite/client.py", line 1473
baseTime = min(time * (1L << retries), cap);
^
SyntaxError: invalid syntax