mozilla / jydoop

Efficient Hadoop Map-Reduce in Python
Other
31 stars 19 forks source link

support for scripts outside the `scripts` directory #48

Open gregglind opened 10 years ago

gregglind commented 10 years ago

(correct me if I am wrong!)

It seems like things need to be scripts. In my fantasy world, things like this should run fine.:

make hadoop ARGS="/some/fullpath/filter.py outputfile 20130330 20130330"
bcolloran commented 10 years ago

+1. Even just using a full path name to a file that is in your scripts directory causes the jydoop make file to blow up.

bsmedberg commented 10 years ago

This in fact pretty hard, because you have to ship the script to the hadoop mappers. Pig has some very fancy logic to recursively figure out the set of python libraries it needs to ship, package them up, and even then it doesn't always work. I'm tempted to say that this is too much hassle and code and we should just document/require scripts to be in the scripts directory.

tarasglek commented 10 years ago

Python eggs might make this feasible, but that still seems like too much work http://peak.telecommunity.com/DevCenter/PythonEggs

bcolloran commented 10 years ago

Yeah, this is definitely not a huge deal, it just complicates some workflows/data organizations. So at least imo, not something anyone should spend a lot of time on (gregg, if you have stronger feelings I trust you will speak up).

I definitely don't understand all the intricacies, but is this mostly complicated when scripts go looking for modules? Would it be possible for the actual script being run to be loaded from an arbitrary location, but require any custom modules being loaded to live in jydoop/pylib or jydoop/scripts?

That would cover my use case, which is that I have a lot of scripts associated with a lot of projects, including lots of quick one-offs that I want to hang on to. So my scripts folder is getting quite unwieldy, but I have only a few utilities that I've promoted to being 'modules'. And Ideally, it'd be nice to have scripts stored with the other files related to project. But again, for me this is a workflow paper cut, not a real pain point...

thanks guys.

tarasglek commented 10 years ago

bcollaran, it would be doable to make jydoop be more careful about symlinks. This way you could keep your source elsewhere and have jydoop pointed at symlinks in the scripts directory

On Wed, Dec 11, 2013 at 11:45 AM, bcolloran notifications@github.comwrote:

Yeah, this is definitely not a huge deal, it just complicates some workflows/data organizations. So at least imo, not something anyone should spend a lot of time on (gregg, if you have stronger feelings I trust you will speak up).

I definitely don't understand all the intricacies, but is this mostly complicated when scripts go looking for modules? Would it be possible for the actual script being run to be loaded from an arbitrary location, but require any custom modules being loaded to live in jydoop/pylib or jydoop/scripts?

That would cover my use case, which is that I have a lot of scripts associated with a lot of projects, including lots of quick one-offs that I want to hang on to. So my scripts folder is getting quite unwieldy, but I have only a few utilities that I've promoted to being 'modules'. And Ideally, it'd be nice to have scripts stored with the other files related to project. But again, for me this is a workflow paper cut, not a real pain point...

thanks guys.

— Reply to this email directly or view it on GitHubhttps://github.com/mozilla/jydoop/issues/48#issuecomment-30355986 .