Foreign file staging can allow pipelines to proceed with corrupt files

scwatts commented 1 month ago

Bug report

Expected behavior and actual behavior

Description and background

foreign file staging via HTTP/S can report successful transfer following connection interruption
under certain conditions this allows pipelines to continue with incomplete and corrupt inputs
this scenario has been observed intermittently for nf-core/oncoanalyser when retrieving reference data from Cloudflare R2
- most recently reported by a user in the nf-core Slack Workspace
- this issue has also been reported directly to me by collaborators
- I've also personally encountered this error with regular but low frequency over the past 8 months

Desired behaviour

minimally to emit an error message and halt pipeline execution, or
ideally retry / resume download with required byte range to complete staging

Steps to reproduce the problem

Given the difficultly involved in reproducing this intermittent problem I've put together toy example that can replicate behaviour locally with small data. The simple example is not a perfect replication but fundamentally demonstrates the problem described above.

File: run_simple_http_server.py (click to show)

```python #!/usr/bin/env python3 import argparse import http.server import time TAR_DATA = b'\x1f\x8b\x08\x00\xb7\xa3\xb5f\x00\x03\xed\xd1M\n\xc20\x10\x05\xe0\xac=ENP\'\xe9\xfc\x9c\xa7BB\x17\xc5H\x1b\xc1\xe3\x1bm)"\xb4\xbb(\xe2|\x9b\xb7\x19\x92I^L\xe9h*\x03\x00!\xb2\xcf\xe49\xc1\xe3\x9c\x0b\xeb\x90\x89Z\xef\x98\xc5\x82\xf3(\xceX\xaa\xbd\xd8\xc3u\xca\xddXV\x99r\xb8\xf4\xe1\xbc9W\xc6b\xdc9gy\xc7\x9a?"\x96\xfeO\xdd\xd8\xe4[\xaevG\xf9\x0fF\xdc\xeb\x9f\xdf\xfaoE\xbc\xb1Pm\xa3\x17\x7f\xde\x7f\x1f\x86!\x1d\xbe\xbd\x85RJ\xa9O\xbb\x03\xe7+\x86\x11\x00\n\x00\x00' class CustomRequestHandler(http.server.SimpleHTTPRequestHandler): protocol_version = 'HTTP/1.1' simulate_failure = False def do_GET(self): # Send headers including expected content length self.send_response(200) self.send_header('Content-Length', len(TAR_DATA)) self.end_headers() # Write initial chunk then create artificial pause for NF to display staging message self.wfile.write(TAR_DATA[:16]) time.sleep(5) # Write remaining data except final several bytes so that we can interrupt if requested self.wfile.write(TAR_DATA[16:-16]) # Simulate server-side interruption that should raise an exception in client if self.simulate_failure: raise Exception # Complete writing and flush output self.wfile.write(TAR_DATA[-16:]) self.wfile.flush() def get_arguments(): parser = argparse.ArgumentParser() parser.add_argument('--simulate_serverside_failure', action='store_true') return parser.parse_args() def main(): # Get commandline args and configure args = get_arguments() CustomRequestHandler.simulate_failure = args.simulate_serverside_failure # Launch HTTP server with http.server.ThreadingHTTPServer(('localhost', 8000), CustomRequestHandler) as server: server.serve_forever() if __name__ == '__main__': main() ```

File: main.nf (click to show)

```nextflow #!/usr/bin/env nextflow process DECOMPRESS_TARBALL { publishDir 'output/' input: path tarball output: path 'foo/' script: """ tar -zxvf ${tarball} """ } workflow { DECOMPRESS_TARBALL('http://localhost:8000/foo.tar.gz') } ```

After writing above scripts to disk, problem behaviour can be replicated with the following:

# First shell
python3 run_simple_http_server.py --simulate_serverside_failure

# Second shell
nextflow run main.nf

Remove --simulate_serverside_failure to show normal operation. Tested with Nextflow 24.04.4 and 24.07.0-edge.5923 on macOS.

Program output

Console output when simulating the above serverside error (i.e. staging fails but pipeline proceeds with corrupt input)


 N E X T F L O W   ~  version 24.04.4

Launching `main.nf` [happy_bhaskara] DSL2 - revision: fe3bec1c93

executor >  local (1)
[4a/5386b5] process > DECOMPRESS_TARBALL [100%] 1 of 1, failed: 1 ✘
Staging foreign file: http://localhost:8000/foo.tar.gz
ERROR ~ Error executing process > 'DECOMPRESS_TARBALL'

Caused by:
  Process `DECOMPRESS_TARBALL` terminated with an error exit status (1)

Command executed:

  tar -zxvf foo.tar.gz

Command exit status:
  1

Command output:
  (empty)

Command error:
  tar: Error opening archive: truncated gzip input

Work dir:
  /Users/stephen/projects/nextflow_simulate_connection_close/work/4a/5386b592a8d2e8cde461e2bf877fa5

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Attachments: nextflow.log

Environment

Nextflow version: 24.04.4 (additionally confirmed: 24.07.0-edge.5923; additionally observed: other older 24.XX.X versions)
Java version: 17.0.3 2022-04-19 LTS
Operating system: macOS (additionally observed: Linux)
Bash version: 5.1.16(1)-release

Additional context

None

ewels commented 3 weeks ago

Great minimal example, thanks for this!

jfy133 commented 3 weeks ago

I also don't have explicit examples of these either due to inconsistent behaviour, but I've seen similar things occasionally with nf-core/mag that has very large databases (77-200 GB tar.gz) and/or on distant servers (Australia). I assume it's something to do with the connection dropping/timing out at points, but occasionally the pipeline try to continue to despite the incomplete download.

ewels commented 3 weeks ago

@scwatts - thinking a bit more about this. Do you have any ideas about how we can detect that a download is incomplete?

The ones that spring to mind for me are the Content-length header (can't rely on this being present) or a checksum (typically not present). I'm not sure that HTTP connections really have many other safeguards in place for noticing an incomplete download?

scwatts commented 3 weeks ago

I think we are limited here to making use of the Content-Length or Transfer-Encoding header fields.

Just reading through RFC 9110 and I see that the spec stipulates a server should always send the Content-Length header field where size is known ahead of time. Though as you point out server implementations are not necessarily required to do so.

However, since we will often have Content-Length available I think it should be made use of when possible to improve reliability of foreign file staging via HTTP/S even if not all corner cases are covered.

If going this route it would also be feasible to take Content-Length for detection of interrupted downloads and extend to transfer restart / resume using range specifiers where the server shows support through Accept-Ranges.

pditommaso commented 2 weeks ago

The main problem here is that the data is copied via standard Java API that rely on InputStream data structure that by definition has no visibility to the overall length.

However, it should be possible to wrap the stream into a filter and fetch the content length as suggested above. To be investigated

https://github.com/nextflow-io/nextflow/blob/5a37e6177f7a0e02b2af922768a0df5984b07b7b/modules/nf-httpfs/src/main/nextflow/file/http/XFileSystemProvider.groovy#L185-L218

scwatts commented 2 weeks ago

Closing now that this has been addressed in the above PR, thank you all!

nextflow-io / nextflow