mozilla / jydoop

Efficient Hadoop Map-Reduce in Python
Other
31 stars 19 forks source link

user profile data dump script #50

Closed oyiptong closed 10 years ago

oyiptong commented 10 years ago

in addition to the jython userprofile.py script, PythonWrapper$ContextWrapper has been modified to expose the hadoop job's configuration

bsmedberg commented 10 years ago

What does this script actually do? What dataset does it operate on? Maybe you should include a docstring explaining it?

oyiptong commented 10 years ago

I added a description. Does that work?

mreid-moz commented 10 years ago

The 'map' function doing another lookup of the key in the HBase table seems like it will be a performance problems for larger data sets.

We can expose the row timestamp to the 'map' function along with key + value instead, then we can do everything in a single pass.

I've got some code that allows you to set a configuration value to expose the timestamps, which would let you write a map function like: def map(key, value, timestamp, context): ...

Will that work for your use case? If so, I'll push another branch you can test out.

mreid-moz commented 10 years ago

The timestamp-enabled code PR is at https://github.com/mozilla/jydoop/pull/51

oyiptong commented 10 years ago

closing this PR, the code was merged in https://github.com/mreid-moz/jydoop/pull/1