Closed oyiptong closed 10 years ago
What does this script actually do? What dataset does it operate on? Maybe you should include a docstring explaining it?
I added a description. Does that work?
The 'map' function doing another lookup of the key in the HBase table seems like it will be a performance problems for larger data sets.
We can expose the row timestamp to the 'map' function along with key + value instead, then we can do everything in a single pass.
I've got some code that allows you to set a configuration value to expose the timestamps, which would let you write a map function like:
def map(key, value, timestamp, context): ...
Will that work for your use case? If so, I'll push another branch you can test out.
The timestamp-enabled code PR is at https://github.com/mozilla/jydoop/pull/51
closing this PR, the code was merged in https://github.com/mreid-moz/jydoop/pull/1
in addition to the jython userprofile.py script, PythonWrapper$ContextWrapper has been modified to expose the hadoop job's configuration