hammerlab / spark-bam

Load genomic BAM files using Apache Spark
http://www.hammerlab.org/spark-bam/
Apache License 2.0
20 stars 5 forks source link

Cram files? #15

Open jjfarrell opened 6 years ago

jjfarrell commented 6 years ago

Does spark-bam handle cram files? If so, how does the reference get specified?

ryan-williams commented 6 years ago

I went through motions of passing .cram-loading through to hadoop-bam, but haven't tested it! You'd just call sc.loadReads like with a .bam.

IIUC, you'd specify relevant options like reference path the same way you do in hadoop-bam, e.g. as a property on the Hadoop Configuration (i.e. SparkContext.hadoopConfiguration).

Making those properties proper method-params to be more idiomatic Scala would be nice.

Feel free to post the results of trying it!

Alternatively, your application code can decide to call hadoop-bam or spark-bam based on the file's extension 🙁

ryan-williams commented 6 years ago

Hey @jjfarrell, I looked through the related posts https://github.com/broadinstitute/gatk/issues/4506 and https://github.com/HadoopGenomics/Hadoop-BAM/issues/196#issuecomment-373769717 and am curious to dig a little bit.

Is there a public .cram file you can point me at? I couldn't tell whether your adni/cram/ADNI_002_S_0413.hg38.realign.bqsr.cram is available anywhere.

jjfarrell commented 6 years ago

@ryan-williams That cram one is not available. However, I am working on a getting a cram of a GIAB sample available for testing.

jjfarrell commented 6 years ago

@ryan-williams

Here is a publiclly available cram from 1000 genomes. Again I found the Spark GATK v4.0.2.1 job was quite slow processing this cram.

gatk FlagStatSpark --input 1000g/cram/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram --reference file:///restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa -- --spark-runner SPARK --spark-master yarn

Here are the urls for a cram....

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHS/HG00419/high_cov_alignment/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/CHS/HG00419/high_cov_alignment/HG00419.alt_bwamem_GRCh38DH.20150917.CHS.high_coverage.cram.crai ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa