rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
275 stars 33 forks source link

does it work with the -Z option? #15

Closed dtenenba closed 6 years ago

dtenenba commented 6 years ago

Looks like parallel-fastq-dump writes out multiple different parts of the output file simultaneously. So I assume it can't be used with fastq-dump's -Z option which writes a single stream to stdout.

But thought I'd make sure. Is that the case?

I usually use this option to stream the fastq output directly to Amazon S3 since I have limited disk space available (these fastq files get big) but S3 provides unlimited storage.

It looks like parallel-fastq-dump writes out N files, where N is the number of threads. Are these concatenated together with the equivalent of cat? If so I could write to N files which are actually named pipes that stream to S3, and then concatenate them myself. Since I don't see an option to suppress the concatenation, I probably need to mess with the source... Any thoughts on this?

Thanks.

rvalieris commented 6 years ago

hello, as it currently is, the -Z option will be passed directly to fastq-dump, and you will get multiple processes writing to stdout at the same time, probably not a good idea.

to interact with S3 you could try fuse solutions such as s3fs or goofys, you could mount a S3 bucket as a directory and point parallel-fastq-dump to use it as a temporary directory and/or output directory, these are usually a bad idea for random access, but work fine for serial reading/writing.

the number of files varies with the SRR and cli-options, you could have 2 files per thread (paired-end with read splitting), or even 3 (paired-end + single-end on a single SRR).

the concatenation is not optional, but if you do the fastq-dumping by hand, cat'ing the files (in the correct order) will work. both gzip and bzip2 work fine with concat'ed files.