mobie / mobie-io

BSD 2-Clause "Simplified" License
4 stars 8 forks source link

Performance considerations when reading through http #109

Open tischi opened 2 years ago

tischi commented 2 years ago

Current we are using this code from java.net:

/**
     * Opens a connection to this {@code URL} and returns an
     * {@code InputStream} for reading from that connection. This
     * method is a shorthand for:
     * <blockquote><pre>
     *     openConnection().getInputStream()
     * </pre></blockquote>
     *
     * @return     an input stream for reading from the URL connection.
     * @exception  IOException  if an I/O exception occurs.
     * @see        java.net.URL#openConnection()
     * @see        java.net.URLConnection#getInputStream()
     */
    public final InputStream openStream() throws java.io.IOException {
        return openConnection().getInputStream();
    }

I wonder what that actually does? Specifically, does the InputStream (a) already contain all the downloaded data or (b) not?

tischi commented 2 years ago

Here is something to read: https://www.baeldung.com/java-download-file

tischi commented 2 years ago

@axtimwalde do you know the most performant why to completely load a txt file from an URL into memory?

axtimwalde commented 2 years ago

The InputStream does not yet contain all the downloaded data but can deliver it at request. I haven't done a performance evaluation. I believe the most significant difference between the various approaches is whether you have to load the entire file or only some parts of it via random access. This is pretty comprehensive and includes loading from URLs https://www.baeldung.com/reading-file-in-java

tischi commented 2 years ago

The InputStream does not yet contain all the downloaded data but can deliver it at request

@axtimwalde This is interesting, because I think http requests can have a significant overhead independent of the amount of data transfer.

For example here in your code: https://github.com/saalfeldlab/n5-google-cloud/blob/master/src/main/java/org/janelia/saalfeldlab/n5/googlecloud/N5GoogleCloudStorageReader.java#L206

I would be worried that this code currently entails two http requests (one in line 206 and another one in line 207), just for reading a small text file. Downloading all the information in one go (if possible) might be more performant, what do you think?

tischi commented 2 years ago

I could not find a method that does it "in one go". There seems to be always first the step of opening the InputStream. I tried to benchmark, reading a not so small file:

        long start;

        final String tableURL = "https://raw.githubusercontent.com/mobie/platybrowser-project/main/data/1.0.1/tables/sbem-6dpf-1-whole-segmented-cells/default.tsv";

        start = System.currentTimeMillis();
        URL url = new URL(tableURL);
        final InputStream inputStream = url.openStream();
        System.out.println("Open Table InputStream [ms]: " + ( System.currentTimeMillis() - start ));

        start = System.currentTimeMillis();
                // using apache.commons.io
        final String s = IOUtils.toString(inputStream, StandardCharsets.UTF_8.name());
        System.out.println("Read InputStream into String [ms]: " + ( System.currentTimeMillis() - start ));

and I am getting:

Open Table InputStream [ms]: 766
Read InputStream into String [ms]: 2703

More things to explore: https://stackoverflow.com/questions/309424/how-do-i-read-convert-an-inputstream-into-a-string-in-java