klbostee / dumbo

Python module that allows one to easily write and run Hadoop programs.
http://projects.dumbotics.com/dumbo
1.04k stars 146 forks source link

-input format not handled in local mode #42

Closed jmesnil closed 13 years ago

jmesnil commented 13 years ago

hi,

I want to run Dumbo with a specific input format (to read from Avro files). It seems Dumbo does not use the input format specified by '-inputformat' when it is run locally (without specifying '-hadoop'). Instead it uses its default input format.

To check that, I specify a unknown class with '-inputformat foo.bar.UnknownClass'. It fails on hadoop but passes in local mode.

Hadoop mode:

$ dumbo start cat.py \ -input word-count.avro \ -output tmp \ -libjar avro-1.4.1.jar \ -libjar avro-utils-1.5.3-SNAPSHOT.jar \ -inputformat foo.bar.UnknownClass \ -python /home/sites/sci-env/0.0.5/bin/python \ -hadoop /usr/lib/hadoop ... -inputformat : class not found : foo.bar.UnknownClass Streaming Command Failed!

Local mode:

$ dumbo start cat.py \ -input word-count.avro \ -output tmp \ -libjar avro-1.4.1.jar \ -libjar avro-utils-1.5.3-SNAPSHOT.jar \ -inputformat foo.bar.UnknownClass \ -python /home/sites/sci-env/0.0.5/bin/python INFO: buffersize = 168960

=> no error, tmp was created but it contains the content of the binary avro file as it was read as text...

Is it a limitation of Dumbo that the '-input' format is working only in Hadoop mode or is it a bug?

thanks, jeff

klbostee commented 13 years ago

It's a limitation. Dumbo's local mode only relies on UNIX pipes and doesn't use Hadoop in any way, so specifying a java class as input format for a local run simply cannot work. If you want to test Hadoop helper classes locally, you have to locally install a Hadoop build that is configured to run in local mode (which is the default configuration).