rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
77 stars 17 forks source link

question about input format #17

Open EclipseCN opened 5 years ago

EclipseCN commented 5 years ago

Hi,

I am learning gene imputation recently. And i am interested with your method "STITCH", now i want to run STITCH with .vcf format (the example is with .bam format)

Could u tell me how to do this ? I didn't find related info in the manual. Thank you very much ! :)

rwdavies commented 5 years ago

STITCH doesn't support running on VCF input, only BAM input (and a list of sites to impute). The reason is that low-coverage imputation accuracy is better if you treat the reads as independent events (which they are) versus the per-SNP genotype likelihoods.

EclipseCN commented 5 years ago

Thanks for your apply !

I understand why we start with BAM input now. I have another question about "STITCH_human_example" mentioned in examples.R, after i downloaded and decompressed it, I just see many .RData files in input folder and I guess these files may contain same info as .BAM files. So is it possible to convert these .RData files to .BAM files ? (my purpose is to do human imputation analysis as what I did in mouse examples)

Thank you !

rwdavies commented 5 years ago

When STITCH runs, it converts input from the .bam files into an internal format, saved in .RData files. STITCH can then re-run using just those RData files, which is faster as it avoids re-conversion, and are smaller to store as they contain just the information necessary for imputation. It is generally not possible to convert .RData files back into .bam files as they have lost much of the information contained in the original bam. Surely if you're doing human imputation you already have your own set of bams to work from?

EclipseCN commented 5 years ago

Thanks to your help, now I am clear about the example use of STITCH.

Because I am quite new to this area, and I want to compare STITCH performance with other softwares which start with .vcf files, so now I am trying to compare based on identical dataset (first I convert .bam to .vcf, then for .vcf I do analysis with software like Beagle and for .bam I do analysis with STITCH).

In your paper, you mentioned that for CONVERGE dataset, you use GATK to call variants, then "Genotype likelihoods" from VCF files were used as Beagle's inputs. My question is: Is the info "Genotype likelihoods" (i.e. \<GL> field) in .vcf file produced directly by GATK ?(I can only get \<PL> field from it)

It is very kind of you if you can give me some advice about it.

rwdavies commented 5 years ago

I don't have any GATK VCFs to hand, but yeah, GL and PL encode the same information up to rounding (http://samtools.github.io/hts-specs/VCFv4.1.pdf). PL is GL but stored as Phred scaled integers, vs floating point numbers. I can't remember but Beagle must be able to take one or the other, probably PLs then. PLs are lossy but the amount of information lost is tiny. In the past, this is approximately how I've run GATK to get input for Beagle (https://github.com/rwdavies/lcimpbench/blob/master/snakefiles/Snakefile_make_genos#L98)

Small other note for running GATK -> Beagle, if your data is very low coverage, GATK may omit entries (i.e. encode them as just "./." without PL or GL) where there are no reads, and Beagle will complain. So in the past I've had to do something like this https://github.com/rwdavies/lcimpbench/blob/master/scripts/make_vcf_acceptable_for_beagle.sh although note here this is for running a single sample through (it changes only a single column, whereas for running with multiple samples, you'll need something more complex).

One other small note, STITCH has a function outputInputInVCFFormat, so if you run STITCH with that, including on previously generated input (regenerateInput = FALSE), it should make a VCF acceptable for Beagle, although I haven't used that option in a long time, and am not 100% sure it still works properly.

EclipseCN commented 5 years ago

Thank you very much for your snakemake scripts. It's really helpful to me

Now I prepare my BAM input and a pos.txt contain position info like STITCH_example_2016_05_10, but I get error, see below:

library(STITCH) Loading required package: parallel Loading required package: rrbgen STITCH(tempdir=tempdir(),chr="chr22",bamlist="bamlist.txt",posfile="pos.txt",o[2019-05-30 20:58:09] Running STITCH(chr = chr22, nGen = 100, posfile = pos.txt, K = 4, outputdir = /home/disk/fyh/lab_other_work/Imputation/test_0516/STITCH_test/, tempdir = /tmp/Rtmpf7hE5a, bamlist = bamlist.txt, cramlist = , sampleNames_file = , reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.4, initial_max_hapProb = 0.6, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE) [2019-05-30 20:58:09] Program start [2019-05-30 20:58:09] Get and validate pos and gen [2019-05-30 20:58:09] Done get and validate pos and gen [2019-05-30 20:58:09] Get BAM sample names [2019-05-30 20:58:09] Done getting BAM sample names [2019-05-30 20:58:09] Generate inputs Error in `/home/zhangfeng/bin/lib64/R/bin/exec/R': free(): invalid pointer: 0x000000000359ccc8

some unnecessary info in R

Aborted (core dumped)

I don't know what's wrong with my BAM file, BAM file is produced by:

  1. download from 1000G project (for example: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00097/exome_alignment/HG00097.mapped.ILLUMINA.bwa.GBR.exome.20130415.bam)
  2. then I just extract region and make index
  3. I run 5 samples (HG00096-HG00101) with STITCH

And I can run successfully with STITCH_example_2016_05_10 so I don't know why this happened.

Thank you !

rwdavies commented 5 years ago

Huh, that's a new one. I'm wondering if this is an out of RAM problem? What's the coverage? If this is 5 Mbp at high coverage it could run into trouble maybe (though I'd still be surprised). Can you try running those BAMs using a smaller region (setting regionStart, regionEnd and buffer, say 10 kbp, then 100 kbp) and see if it runs through?

Otherwise, can you try re-installing Rcpp and RcppArmadillo, and see if that helps (otherwise you can try the bioconda version?)

EclipseCN commented 5 years ago

I use samtools view -b input.bam 22:20000000-20010000 > output.bam to extract region

and use below to calculate coverage input.samtools stats input.bam | grep "bases mapped (cigar):" | cut -f3 high_cov=totalbases /REGION_LENGTH | bc -l

it is about 0.3-2.5X

It must be something wrong with my bam files, I will try other bam files again, thank you!