klbostee / dumbo

Python module that allows one to easily write and run Hadoop programs.
http://projects.dumbotics.com/dumbo
1.04k stars 146 forks source link

Fix distributing typedbytes module when it's not installed as an egg #51

Closed aripollak closed 11 years ago

aripollak commented 12 years ago

The existing condition didn't make sense if typedbytes was not installed as an egg, since it would make Hadoop think the typedbytes module was on HDFS. The new method is the same as what's in backends/common.

klbostee commented 12 years ago

I'm afraid I'm not following completely here. When typedbytes is not installed as an egg, then the old code will revert to opts.add('file', modpath) which will make sure the .py file is send along and thus available on HDFS, right? Not sure what's left to fix then...

aripollak commented 12 years ago

Unfortunately I forgot exactly what was happening, but I definitely tested dumbo after installing typedbytes through pip, and it didn't work with the original code but it worked with this change. I think the problem might have been that opts['file'] would be re-interpreted by the code starting at line 176. The module path didn't start with file://, so it wasn't actually getting passed to streaming as a -file. But if you add it as a libegg, it does get sent along with the job.

optimuspaul commented 12 years ago

typedbytes is installed on every node in my cluster, I don't need or want it distributed to HDFS. Does this patch remove that "feature"?

klbostee commented 11 years ago

Think this is a better solution:

https://github.com/klbostee/dumbo/commit/b67a7b1dfeaa7df8fa98a6cbb550a7fec201fa4d

Thanks!