jghoman / haivvreo

Hive + Avro. Serde for working with Avro in Hive
Apache License 2.0
59 stars 27 forks source link

Select from Avro-backed Hive table fails when schema is in HDFS #6

Closed tomwhite closed 13 years ago

tomwhite commented 13 years ago

Hi Jakob,

There's an interesting error case when the Avro schema is read from the same HDFS filesystem that it is reading or writing to. Haivvreo closes the HDFS filesystem object after reading the schema, but since this is a cached object (cached by FileSystem) the subsequent read or write on HDFS fails since the DistributedFileSystem object is now closed.

I think the simplest solution is not to close the HDFS filesystem object. It will be automatically closed when the task VM terminates.

Thoughts?

Tom

jghoman commented 13 years ago

Hmmm. This is interesting. We use this feature all the time, I wonder why we've not run into it. What's the environment you saw this in? I agree with the fix.

tomwhite commented 13 years ago

Thanks Jakob. Could it be because you have fs.hdfs.impl.disable.cache set to true? See https://issues.apache.org/jira/browse/HADOOP-6231. Or perhaps the HDFS URI is different in some way (e.g. different host address format).

I thought about creating the FileSystem to read the schema with fs.hdfs.impl.disable.cache set to true, but that doesn't necessarily help if the one used to read/write data to/from HDFS is not also created in the same way. This is because the FileSystem for a given key is always removed from the cache in the close method, so closing the schema-reading FS might still inadvertently remove another HDFS FS instance.

So, I think this is right fix. Thanks for merging it.