Open jjfarrell opened 6 years ago
I went through motions of passing .cram
-loading through to hadoop-bam, but haven't tested it! You'd just call sc.loadReads
like with a .bam
.
IIUC, you'd specify relevant options like reference path the same way you do in hadoop-bam, e.g. as a property on the Hadoop Configuration
(i.e. SparkContext.hadoopConfiguration
).
Making those properties proper method-params to be more idiomatic Scala would be nice.
Feel free to post the results of trying it!
Alternatively, your application code can decide to call hadoop-bam or spark-bam based on the file's extension 🙁
Hey @jjfarrell, I looked through the related posts https://github.com/broadinstitute/gatk/issues/4506 and https://github.com/HadoopGenomics/Hadoop-BAM/issues/196#issuecomment-373769717 and am curious to dig a little bit.
Is there a public .cram
file you can point me at? I couldn't tell whether your adni/cram/ADNI_002_S_0413.hg38.realign.bqsr.cram
is available anywhere.
@ryan-williams That cram one is not available. However, I am working on a getting a cram of a GIAB sample available for testing.
@ryan-williams
Here is a publiclly available cram from 1000 genomes. Again I found the Spark GATK v4.0.2.1 job was quite slow processing this cram.
gatk FlagStatSpark --input 1000g/cram/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram --reference file:///restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -- --spark-runner SPARK --spark-master yarn
Here are the urls for a cram....
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHS/HG00419/high_cov_alignment/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHS/HG00419/high_cov_alignment/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram.crai ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
Does spark-bam handle cram files? If so, how does the reference get specified?