samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

IOUtil.getPath fails when the input has spaces. #1034

Open lbergelson opened 6 years ago

lbergelson commented 6 years ago

IOUtil.getPath("some file") explodes:

Caused by: java.net.URISyntaxException: Illegal character in path at index 4: some file
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parseHierarchical(URI.java:3105)
    at java.net.URI$Parser.parse(URI.java:3063)
    at java.net.URI.<init>(URI.java:588)
    at java.net.URI.create(URI.java:850)
    ... 25 more

This is due to converting it to a URI first. It should not explode.

nh13 commented 6 years ago

I also think we should be using the URI constructor and not the URI.create method after reading the javadoc for the latter.

jb-adams commented 5 years ago

I had a look at this yesterday, I think we can get rid of the error if we check that the passed string is a valid URI (has scheme, no illegal characters, etc.) before invoking URI.create. If not, then the string is treated as a local filesystem path.

Thoughts?

cmnbroad commented 5 years ago

@jb-adams Hi Jeremy - I think you're on the right track. We have developed a more generalized solution for this in a separate repo, that we use in GATK, in the form of an interface and a concrete type that handles these conversions. It normalizes all inputs to a URI internally, and also has some other advantages. Integrating that into htsjdk is a much a larger task, but you might look at the implementation for inspiration, and there are also a boatload of test cases in there.

jb-adams commented 5 years ago

Hi @cmnbroad, looks great, I think it would be good to implement something similar in htsjdk. I'm pretty new to developing here, so I was curious, will htsjdk-next-beta replace htsjdk? Which repo are current versions of GATK and Picard using?

cmnbroad commented 5 years ago

@jb-adams GATK and Picard both depend on this repo; htsjdk-next doesn't (yet) have a sufficient critical mass of functionality to warrant switching. The URI classes were designed with the idea that they would migrate their way here at some point, but I'd suggest getting maintainer buy-in before spending much time on it. @lbergelson or @tfenne any thoughts on this ? GATK has its own copy of them, which could be discarded if they lived in htsjdk, and I think it would benefit Picard and possibly other consumers as well.

jb-adams commented 5 years ago

@cmnbroad sounds good, I'll wait till we get the go ahead from maintainers, then I can work on this.