Open marchoeppner opened 1 year ago
Looks interesting. Sounds easy enough to do. I'd like to see that in nf-core/modules first before adding to sarek. But we will definitively have a look
I see the problem with the ORA reference genome -- you'll have to know exactly what was used and have access to it. (This is basically the same issue as with CRAMs, except that with those we supply the reference genome, so in the context of the pipeline, that's not an issue.) Presumably Illumina uses some versions for the common model organisms and provide a source from where to download it. Either we need to have code in the pipeline that handles the download, or use a parameter and make the user do it, or even add it to iGenomes.
Could a first step perhaps just be to get orad available for conda/mamba, singularity and docker?
Also - does anyone here know where I might find some ora-compressed fastq-files along with link to the ORA reference-genome which was used for the compression? (Unfortunately, I don't seem to have access to Dragen at the moment.)
Description of feature
Hi,
Illumina has introduced a new read compression format, ORA: https://www.illumina.com/science/genomics-research/articles/design-ora-lossless-genomic-compression.html
ORA compresses human read data by 80% compared to traditional fastq.gz - I suspect it will become a commonly used option for data rolling off the upcoming NovaSeq X and NextSeq 1500 instruments (on-board support for ORA compression).
ORA is lossless and can be converted, or better yet streamed, into fastq.gz - which requires a reference and small command line utility - see: https://emea.support.illumina.com/sequencing/sequencing_software/DRAGENORA.html
For example, to stream ORA-compressed paired-end read data to bwa, you could do:
Would be nice to see support for this make it into Sarek.