klbostee / dumbo

Python module that allows one to easily write and run Hadoop programs.
http://projects.dumbotics.com/dumbo
1.04k stars 146 forks source link

It's hard to use an output as an input #43

Closed brisssou closed 13 years ago

brisssou commented 13 years ago

Hello! Let me explain that one: I have two jobs creating there own output files. Then I want to merge those two files using a third job.

In my first attempt, the first two jobs were yielding python structures (dict) as values, and unicode strings as key, which turned out to be dumb. I would have had to eval the keys and values in my third job. I'm not sure anyone would want to do that.

Now I try to output pure strings through encode('utf-8') and some json.dumps. I now have string every where, dumbo cat confirmed it.

But, if I try to use those two files as input for the third merge job, keys and values are single-quoted, which if quite a pain to test my code locally. Of course, I will use dumbo cat out > out.txt to be able to test the merge job locally, but the code driving those three jobs won't be testable unless ran on a real hadoop cluster.

Did I miss something?

Thanks a lot for your help!

klbostee commented 13 years ago

Sounds like using -input codefor the local run might fix your problem. (And, iirc, using that option should work on hadoop as well.)

Closing this issue now but feel free to reopen if I misunderstood your problem.