nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
409 stars 417 forks source link

Support for Illumina ORA format #907

Open marchoeppner opened 1 year ago

marchoeppner commented 1 year ago

Description of feature

Hi,

Illumina has introduced a new read compression format, ORA: https://www.illumina.com/science/genomics-research/articles/design-ora-lossless-genomic-compression.html

ORA compresses human read data by 80% compared to traditional fastq.gz - I suspect it will become a commonly used option for data rolling off the upcoming NovaSeq X and NextSeq 1500 instruments (on-board support for ORA compression).

ORA is lossless and can be converted, or better yet streamed, into fastq.gz - which requires a reference and small command line utility - see: https://emea.support.illumina.com/sequencing/sequencing_software/DRAGENORA.html

For example, to stream ORA-compressed paired-end read data to bwa, you could do:

bwa mem humanref.fasta <(orad file.fastq.ora -c --raw --ora-reference /path/to/ora-reference ) > resu.sam

Would be nice to see support for this make it into Sarek.

maxulysse commented 1 year ago

Looks interesting. Sounds easy enough to do. I'd like to see that in nf-core/modules first before adding to sarek. But we will definitively have a look

tdanhorn commented 6 months ago

I see the problem with the ORA reference genome -- you'll have to know exactly what was used and have access to it. (This is basically the same issue as with CRAMs, except that with those we supply the reference genome, so in the context of the pipeline, that's not an issue.) Presumably Illumina uses some versions for the common model organisms and provide a source from where to download it. Either we need to have code in the pipeline that handles the download, or use a parameter and make the user do it, or even add it to iGenomes.

asp8200 commented 5 months ago

Could a first step perhaps just be to get orad available for conda/mamba, singularity and docker?

Also - does anyone here know where I might find some ora-compressed fastq-files along with link to the ORA reference-genome which was used for the compression? (Unfortunately, I don't seem to have access to Dragen at the moment.)