Docs request: Fetching remote files

trev-f commented 1 week ago

New feature (docs)

I would like to request documentation describing how remote files are downloaded/staged in Nextflow.

Usage scenario

Projects that require fetching large amounts of data from remote sources are common, and it's necessary to fetch those files in an efficient manner. While Nextflow makes it easy to download remote files, the lack of documentation on how remote files are handled makes it difficult to evaluate when to fetch files with this built-in Nextflow option versus building a more tailored solution.

Currently, the lack of documentation makes it difficult to build a mental model for how downloading remote files works in Nextflow. Since fetching remote data can be a massive bottleneck for some projects, it's imperative that users understand how Nextflow works so that we can build more efficient workflows.

Suggest implementation

In the remote files docs, answer some basic questions about how remote files are handled, such as:

What triggers the download of a remote file? When the file() method is called on a string that resembles a path? When a Channel is created from a Path object? When a Path object inside a Channel is accessed?
What is the default storage location for fetched files?
Are remote files cached, or are they fetched again for each run of a pipeline? Does -resume affect this behavior?
Is there a way to publish downloaded files to a specified location, or do pipeline developers need to write bespoke solutions using the methods defined for Path objects?
What Nextflow job is responsible for downloading files? Is this done in the main job?
How are remote files actually fetched? What packages/softwares/tools in Groovy are used to perform the download?

bentsherman commented 6 days ago

@trev-f To answer your immediate questions:

remote file download is triggered when a task is created with an input file that does not reside on the same filesystem as the task work directory
remote files are staged into the work directory in a special subdirectory of the form stage-<hash>. need to consider whether it's worth documenting the components of that hash
remote files are cached as best as they can using the aforementioned hash. of course if the same remote file is requested by multiple tasks at the same time, they will likely each download a separate copy to separate folders. would be good to document this caching behavior in more detail
If you don't want to rely on the built-in remote file staging, you can write a custom process to download the file into a task directory. Make sure to provide the file name as a val input instead of a path input so that it isn't staged by Nextflow
Nextflow itself downloads these files. As you can imagine, this doesn't always scale well, which is why we generally recommend using S3 for inputs and work directory + Fusion, so that the tasks can stage the input files transparently from either location
We try to use standard libraries as much as possible. For HTTP/FTP we use HttpURLConnection and FtpURLConnection, for S3 we use the AWS Java SDK, etc. You can look at the various implementations of FileSystemProvider in the Nextflow codebase for details

@christopher-hakkaart I think we can add a section under Workflow with files > Remote Files, what do you think? You can try a first draft if you want, but I might need to do it myself because I need to check a few details in the code. In any case, this would be a great thing to document as it is a mystery to many users and unfortunately doesn't rise to the level of something that just always magically works.

christopher-hakkaart commented 5 days ago

Hi both, I'll write a draft and link the issue for feedback/corrections.

bentsherman commented 5 days ago

Sounds good, once you have a first draft I can add some details as needed

nextflow-io / nextflow