klbostee / dumbo

Python module that allows one to easily write and run Hadoop programs.
http://projects.dumbotics.com/dumbo
1.04k stars 146 forks source link

dumbo cat can be slow in case of many part files #67

Closed klbostee closed 11 years ago

klbostee commented 11 years ago

The fix for https://github.com/klbostee/dumbo/issues/1 has the drawback that looping over all part files and catting them one after the other can be quite slow when there's many part files. By adding an option for enabling the old behaviour of combined catting of a whole directory in one go, we could allow people to speed up the cat. This would of course lead to problems again when e.g. _logs subdirs are present, but when the option is off by default this shouldn't be a problem...