samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
279 stars 244 forks source link

SeekableStream and TabixReader cannot read from ftp #797

Open dariober opened 7 years ago

dariober commented 7 years ago

Subject of the issue

I don't know if it's just me... I cannot read a remote file sitting on an ftp server using TabixReader or SeekableStream as I get a SocketTimeoutException.

Your environment

Steps to reproduce

TabixReader tabixReader= new TabixReader("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/input_call_sets/ALL.wex.union_illumina_wcmc_bcm_bc_bi.20110521.snps.exome.sites.vcf.gz");
ISeekableStreamFactory ssf= SeekableStreamFactory.getInstance();
ssf.getStreamFor(new URL("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/input_call_sets/ALL.wex.union_illumina_wcmc_bcm_bc_bi.20110521.snps.exome.sites.vcf.gz"));

Expected behaviour

A valid TabixReader reader or SeekableStream object.

Actual behaviour

A SocketTimeoutException

Full stack trace:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:170)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at htsjdk.samtools.util.ftp.FTPReply.<init>(FTPReply.java:37)
    at htsjdk.samtools.util.ftp.FTPClient.executeCommand(FTPClient.java:82)
    at htsjdk.samtools.util.ftp.FTPClient.login(FTPClient.java:91)
    at htsjdk.samtools.util.ftp.FTPUtils.connect(FTPUtils.java:115)
    at htsjdk.samtools.seekablestream.SeekableFTPStreamHelper.<init>(SeekableFTPStreamHelper.java:49)
    at htsjdk.samtools.seekablestream.SeekableFTPStream.<init>(SeekableFTPStream.java:39)
    at htsjdk.samtools.seekablestream.SeekableFTPStream.<init>(SeekableFTPStream.java:35)
    at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:78)
    at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:68)
    at tabix.testFTP(tabix.java:17)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

I tried a couple of files on different ftp sites. If instead I try to connect to a file on http both of the above work fine so my connection should be ok.

This is with htsjdk 2.8.1.

Can you reproduce this problem? Any ideas what's wrong?

Thank you Dario

iromeo commented 6 years ago

Hi, any progress on this?

I'm sure it is htsjdk FTPClient impl issue, because Apache FTPClient works ok. Simple reproducing scenario (in Kotlin):

import htsjdk.samtools.util.ftp.FTPClient
import org.apache.commons.net.ftp.FTPClientConfig
import java.net.URI

fun main(args: Array<String>) {
    val uri = URI.create("ftp://ftp.ebi.ac.uk/pub/databases/blueprint/blueprint_progenitor_methylomes/rna/RNA_D1_CLP_100.bw")

    try {
        println("Apache FTP")
        appacheFtp(uri)
    } catch (e: Exception) {
        e.printStackTrace()
    }
    println()
    try {
        println("HTSJDK FTP")
        htsjdkFtp(uri)
    } catch (e: Exception) {
        e.printStackTrace()
    }
}

fun htsjdkFtp(uri: URI) {
    val ftp = FTPClient()

    println("Connecting..")
    ftp.connect(uri.host)

    println("Login...")
    ftp.login("anonymous", "")

    println("Reading ${uri.path}...")
    val reply = ftp.retr(uri.path)
    println(reply.isPositiveCompletion)
    println("Done")
    ftp.disconnect()
}

fun appacheFtp(uri: URI) {
    val ftp = org.apache.commons.net.ftp.FTPClient()
    val config = FTPClientConfig()
    ftp.configure(config)

    println("Connecting..")
    ftp.connect(uri.host)
    println("[Connected] ${ftp.replyString}")

    //ftp.enterLocalActiveMode();
    ftp.enterLocalPassiveMode();

    println("Login...")
    ftp.login("anonymous", "")

    println("Reading ${uri.path}...")
    ftp.retrieveFileStream(uri.path).use { it.read() }

    println("DONE")
    ftp.logout()
    ftp.disconnect()
}

Output

Apache FTP
Connecting..
[Connected] 220-   
220- ftp1.ebi.ac.uk FTP server
220-   
220- WARNING: please note that the private part of this ftp service 
220- has been migrated to ftp-private.ebi.ac.uk on 3rd June 2010.
220 

Login...
Reading /pub/databases/blueprint/blueprint_progenitor_methylomes/rna/RNA_D1_CLP_100.bw...
DONE

HTSJDK FTP
Connecting..
Login...
////////////// ~ several minutes /////////
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.readLine(BufferedReader.java:324)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at htsjdk.samtools.util.ftp.FTPReply.<init>(FTPReply.java:37)
    at htsjdk.samtools.util.ftp.FTPClient.executeCommand(FTPClient.java:82)
    at htsjdk.samtools.util.ftp.FTPClient.login(FTPClient.java:91)
    at org.jetbrains.bio.util.FooKt.htsjdkFtp(Foo.kt:32)
    at org.jetbrains.bio.util.FooKt.main(Foo.kt:19)
iromeo commented 6 years ago

P.S: I've tried to rewrite SeekableFTPStreamHelper using apache FTP client and everything started working.

magicDGS commented 6 years ago

I recommend to use instead a FileSystem provider for FTP and use java.nio to access paths in FTP servers. One in my radar to support FTP in my toolkit is https://robtimus.github.io/ftp-fs/ (still untested with HTSJDK). I also think that in the future htsjdk3 code to handle this kind of IO will disappear in favor of java.nio.Path, which is in line with my suggestion of using ftp-fs.

I hope that this helps!