It's hard to use an output as an input

Hello! Let me explain that one: I have two jobs creating there own output files. Then I want to merge those two files using a third job.

In my first attempt, the first two jobs were yielding python structures (dict) as values, and unicode strings as key, which turned out to be dumb. I would have had to eval the keys and values in my third job. I'm not sure anyone would want to do that.

Now I try to output pure strings through encode('utf-8') and some json.dumps. I now have string every where, dumbo cat confirmed it.

But, if I try to use those two files as input for the third merge job, keys and values are single-quoted, which if quite a pain to test my code locally. Of course, I will use dumbo cat out > out.txt to be able to test the merge job locally, but the code driving those three jobs won't be testable unless ran on a real hadoop cluster.

Did I miss something?

Thanks a lot for your help!

klbostee / dumbo

It's hard to use an output as an input #43