klbostee / dumbo

Python module that allows one to easily write and run Hadoop programs.
http://projects.dumbotics.com/dumbo
1.04k stars 146 forks source link

Add cleanup functionality #60

Closed mshevelev closed 12 years ago

mshevelev commented 12 years ago

Sometimes it is useful to output some additional records after all lines are processed. For example, you maintain some data structure and on every call of call() method you update this it. After mapper/reducer/combiner finishes processing the input you can iterate through the structure and output some additional records.

As far as I know this feature is supported by Java-Hadoop. There is no restriction to do this in streaming. This feature can be easily implemented in dumbo.

mshevelev commented 12 years ago

Example usage:

import dumbo

def mapper(_, line): for word in line.strip().split(): yield word, 1

class Reducer:

  def __init__(self):
      self.nwords = 0 

  def __call__(self, word, values):
      s = sum(values)
      self.nwords += s
      yield word, s

  def cleanup(self):
      yield 'Total words', self.nwords

dumbo.run(mapper, Reducer, combiner=dumbo.sumreducer)

kzhai commented 12 years ago

Is it possible to add this example to the short tutorial? It took me a while to find it. Thanks.

klbostee commented 11 years ago

I just added it to the "further reading" section at the end. Might be possible to integrate it more somehow I guess, but now it should at least be easier to find...

scottkwong commented 10 years ago

Java hadoop also supports Setup methods to run before the mapper/reducer start processing lines (e.g., open and read a file). Should this be done in the init method or as a separate method?

a4tunado commented 10 years ago

You should implement configure(self) method for initialization routines