twitter / hadoop-lzo

Refactored version of code.google.com/hadoop-gpl-compression for hadoop 0.20
GNU General Public License v3.0
545 stars 328 forks source link

Make LzoTextInputFormat#listStatus thread safe for concurrent call #120

Open xq262144 opened 8 years ago

xq262144 commented 8 years ago

DeprecatedLzoTextInputFormat and LzoTextInputFormat do not thread-safe. Use ConcurrentHashMap instead of HashMap.

sjlee commented 7 years ago

Thanks @xq262144 for your contribution.

While I understand the desire to make these classes thread safe, I don't think in general that there is no guarantee or expectation that an InputFormat or OutputFormat class should be thread safe. What is the case where you're running into thread-safety issues? Can you not make it work by simply instantiating a new instance for each thread?

xq262144 commented 7 years ago

@sjlee Thank you for your response.

I found this thread-safe issue while trying to integrate DeprecatedLzoTextInputFormat with Hive, and I got my Hive job blocking in HashMap.put method call on a corrupted HashMap data.

Then I analyzed the call stack and found Hive has some sort of input format caching mechanism in here https://github.com/apache/hive/blob/41fbe7bb7d4ad1eb0510a08df22db59e7a81c245/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L250, thus it requires input formats to be thread-safe.

To solve this, it's ok to add an exception for lzo input formats in Hive or make lzo input formats thread-safe.

I choose to make lzo input formats thread-safe while comparing to modifying Hive. Because it's a rather small code base and simpler to upgrade in a production environment.

And since hadoop-lzo is so widely used, I suppose making it thread-safe is not a bad idea. :)

sjlee commented 7 years ago

Thanks for the explanation. It is iffy that Hive caches input format instances and lets them be used by multiple concurrent threads. Hadoop-lzo might not be the only input format types that might have issues.

Have you tried opening a discussion with the Hive community? While I'm not necessarily against making this change (seems fairly low risk), I'm more curious to what the Hive community has to say.

CLAassistant commented 5 years ago

CLA assistant check
All committers have signed the CLA.