kite-sdk / kite

Kite SDK
http://kitesdk.org/docs/current/
Apache License 2.0
394 stars 263 forks source link

KITE-1023: Fix PartitionView URI escaping. #387

Closed rdblue closed 9 years ago

rdblue commented 9 years ago

This adds test for URIs with escaped characters and fixes handling in the FS PartitionView implementation. The implementation was accidentally unescaping URIs by using URI.create, which expects escaped characters, along with URI#getPath, which will remove escapes.

This also fixes the URI returned by getLocation. This previously used Path#getUri, but that method returns the Path's internal URI that is double-escaped. Returning a URI created from Path#toString is the correct behavior because it matches the Path passed in. Similarly, the merge and replace methods in the FS dataset implementation that rely on Paths created from those locations have been updated to correctly construct a Path from a URI by converting the URI to a String rather than setting the internal URI directly.

rdblue commented 9 years ago

@tomwhite, can you take a look at this fix? Something strange is going on with Hadoop's Path. If there is a path in the file system with an escaped directory name, a%2Fb, you can access it if you instantiate the path from a String. But if you create a URI and then a Path from that, you end up with a double-escaped path.

fs.exists(new Path("/tmp/a%2Fb")); // => true
Path badPath = new Path(URI.create("/tmp/a%2Fb"));
fs.exists(badPath); // => false
badPath.toString(); // => "/tmp/a/b"

This means that we can't call Path#getUri inside of getLocation. We have to get the path as a string and then create a URI from it. Similarly, we can't call new Path(location) without introducing a bug. This looks like a serious problem for the API because users will expect to be able to create paths from reported locations and URI is the right thing to return (I thought), but Path will mange that URI if there are escaped characters.

mkwhitacre commented 9 years ago

Informal validation, I built your branch locally then tried out my unit tests on that version and I was able to get the builds to pass successfully whereas previously they were failing with the issues mentioned on the JIRA.

rdblue commented 9 years ago

Thanks @mkwhitacre!

tomwhite commented 9 years ago

Hadoop Path can be tricky. Can we document the places where we return a URI to explain the limitation? Also, providing a utility function for converting to a Path would be helpful.

noslowerdna commented 9 years ago

@rdblue Will this be fixed in Kite 1.2.0?

rdblue commented 9 years ago

@noslowerdna, yes. I'm planning on getting it in.

rdblue commented 9 years ago

Okay, I located the methods where we may return a URI that represents a Hadoop Path and added a warning to the javadoc. I also made sure that there are no cases where we use those methods incorrectly. I also rebased. I'll merge this when the tests succeed.

noslowerdna commented 9 years ago

Great, thanks!