Allow for other reference genomes in bamtocram job

EddieLF commented 3 months ago

The bam_to_cram job was using the STAR reference fasta by default - mainly for doing work with RNA seq data.

This PR adds a new argument to the bam_to_cram job, called reference_fasta_path. This will now determine the fasta referenced used by the samtools job.

When the bam_to_cram job is invoked by the seqr_loader_long_read stage BamToCram, we can set our reference path to the default hg38 masked reference from the Broad, the same one used in the Alignment pipeline.

I've also updated the only other place the bam_to_cram job was used in cpg_workflows, which is in align_rna.py. I set the reference path to be the original STAR hg38 fasta for this job, so it will function the same after this change.

@cassimons this should resolve the IGV.js issue regarding viewing the long read CRAMs.

When we converted from BAM to CRAM with samtools, we used the STAR fasta from the rna seq workflows as the reference: https://batch.hail.populationgenomics.org.au/batches/453334/jobs/3

I reran the BamToCram stage in testing with the standard masked hg38 reference instead: https://batch.hail.populationgenomics.org.au/batches/453334/jobs/4

It ran 3x faster and the new CRAM is has no errors in IGV.js 🚀

EddieLF commented 3 months ago

Also @MattWellie FYI - I'm not sure if you've used the CRAMs that came out of the BamToCram stage yet for any other workflows. If so, we might want to run those workflows again and regenerate their results once this is merged and we've re-converted the bams with the correct reference.

cassimons commented 3 months ago

oof. Sorry, that is nasty. We should have caught that before it was merged in the RNA work. Good work tracking this down 🙌

MattWellie commented 3 months ago

For long read bits I've not touched the CRAMs, the workflow hasn't run in anger yet, and starts with the VCFs

populationgenomics / production-pipelines

Allow for other reference genomes in bamtocram job #809