Closed EddieLF closed 3 months ago
Also @MattWellie FYI - I'm not sure if you've used the CRAMs that came out of the BamToCram stage yet for any other workflows. If so, we might want to run those workflows again and regenerate their results once this is merged and we've re-converted the bams with the correct reference.
oof. Sorry, that is nasty. We should have caught that before it was merged in the RNA work. Good work tracking this down 🙌
For long read bits I've not touched the CRAMs, the workflow hasn't run in anger yet, and starts with the VCFs
The bam_to_cram job was using the
STAR
reference fasta by default - mainly for doing work with RNA seq data.This PR adds a new argument to the bam_to_cram job, called
reference_fasta_path
. This will now determine the fasta referenced used by the samtools job.When the bam_to_cram job is invoked by the
seqr_loader_long_read
stageBamToCram
, we can set our reference path to the default hg38 masked reference from the Broad, the same one used in the Alignment pipeline.I've also updated the only other place the bam_to_cram job was used in cpg_workflows, which is in
align_rna.py
. I set the reference path to be the originalSTAR
hg38 fasta for this job, so it will function the same after this change.@cassimons this should resolve the IGV.js issue regarding viewing the long read CRAMs.
When we converted from BAM to CRAM with samtools, we used the STAR fasta from the rna seq workflows as the reference: https://batch.hail.populationgenomics.org.au/batches/453334/jobs/3
I reran the BamToCram stage in testing with the standard masked hg38 reference instead: https://batch.hail.populationgenomics.org.au/batches/453334/jobs/4
It ran 3x faster and the new CRAM is has no errors in IGV.js 🚀