Open xq262144 opened 8 years ago
Thanks @xq262144 for your contribution.
While I understand the desire to make these classes thread safe, I don't think in general that there is no guarantee or expectation that an InputFormat or OutputFormat class should be thread safe. What is the case where you're running into thread-safety issues? Can you not make it work by simply instantiating a new instance for each thread?
@sjlee Thank you for your response.
I found this thread-safe issue while trying to integrate DeprecatedLzoTextInputFormat
with Hive, and I got my Hive job blocking in HashMap.put
method call on a corrupted HashMap data.
Then I analyzed the call stack and found Hive has some sort of input format caching mechanism in here https://github.com/apache/hive/blob/41fbe7bb7d4ad1eb0510a08df22db59e7a81c245/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L250, thus it requires input formats to be thread-safe.
To solve this, it's ok to add an exception for lzo input formats in Hive or make lzo input formats thread-safe.
I choose to make lzo input formats thread-safe while comparing to modifying Hive. Because it's a rather small code base and simpler to upgrade in a production environment.
And since hadoop-lzo
is so widely used, I suppose making it thread-safe is not a bad idea. :)
Thanks for the explanation. It is iffy that Hive caches input format instances and lets them be used by multiple concurrent threads. Hadoop-lzo might not be the only input format types that might have issues.
Have you tried opening a discussion with the Hive community? While I'm not necessarily against making this change (seems fairly low risk), I'm more curious to what the Hive community has to say.
DeprecatedLzoTextInputFormat
andLzoTextInputFormat
do not thread-safe. UseConcurrentHashMap
instead ofHashMap
.