marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

Remote File Support for input files #761

Open geertvandeweyer opened 4 months ago

geertvandeweyer commented 4 months ago

I've added support for remote files (s3, gcs, ftp, http(s)) as input files.

marcelm commented 4 months ago

Thanks, this is interesting. I’ll have to think about whether I want this. I agree the code is not that intrusive, but it would require some documentation and may cause support requests.

No one has asked for this feature before, would you actually benefit from it?

(I don’t have much time right now, please ping me next week if I haven’t gotten back to you by then.)

geertvandeweyer commented 4 months ago

Yes, I would benefit from this :-)

Cutadapt is the first step in our WES and WGS workflow on AWS. The massive staging of hundreds of FASTQ files when starting the analysis of a novaseq run brings a significant cost in EBS and EFS elastic throughput. By using direct S3 access, the network traffic becomes more spread out over time.

Similar efforts are present for htslib (samtools) and GATK (mainly for google though)

I'm happy to help with the documentation.

rhpvorderman commented 4 months ago

I propose using smart_open with ignore_extension and then passing the filehandle to xopen. Xopen does not support filehandles yet. But it should be possible. Especially since the latest refactorings have almost halved the codebase there is room for some additional functionality again. This way there is no need to handle .xz extensions differently, and gzip files will be very efficiently decompressed.

geertvandeweyer commented 4 months ago

I propose using smart_open with ignore_extension and then passing the filehandle to xopen. Xopen does not support filehandles yet. But it should be possible. Especially since the latest refactorings have almost halved the codebase there is room for some additional functionality again. This way there is no need to handle .xz extensions differently, and gzip files will be very efficiently decompressed.

that's a good suggestion, I'll try to adapt and update here

geertvandeweyer commented 4 months ago

I've made some changes to xopen to support passing open filehandles.

It's this PR : https://github.com/pycompression/xopen/pull/150

Once that is active, the current cutadapt PR can be re-evaluated. I've tested it, and S3 in/out processing with decent network speed is about as fast as local/local processing with 4 threads : approximaely 9M reads/minute for paired data using :

cutadapt --transport-params '{"max_pool_connections":50 , "buffer_size":64008864}' -q 30 -a AGATCGGAAGAG --minimum-length 18 -e 0.1 -O 3 -n 1 -j 4 -o s3://gvdw-testing-bucket-dev/R1.fastq.gz -p s3://gvdw-testing-bucket-dev/R2.fastq.gz s3://wesss-263124-s--20231204-123244/wesss-263124-s_S71_L001_R1_001.fastq.gz s3://wesss-263124-s--20231204-123244/wesss-263124-s_S71_L001_R2_001.fastq.gz