Slow input file conversion

oushujun commented 6 years ago

Hello,

Thank you for developing STITCH. It looks like a very promising program with better performance to impute rare variants.

I was trying to use it on a rice population of 3,500 lines with an average coverage of 10X (3-5 Gb per bam file), however, when preparing for input files, it took a couple days and still produced no input files. I tried the two-step strategy mentioned in #2, if I understand correctly, which is to prepare for input files one at a time to alleviate IO pressure. However, it took 3 days to convert only one bam file, which seems to be computationally unappealing for my population. Do you have any suggestions to accelerate the input file conversion? Or maybe it's not a good idea to use STITCH on this dataset?

I was using STITCH/1.4.0 in CentOs7. Parameters: STITCH.R --chr=Chr1 --bamlist=bamlist.txt --outputdir=./ --K=4 --nCores=50 --expRate=3.868 --Jmax=1 --gridWindowSize=100000 --nGen=540000 --posfile=Chr1.poslist

Thank you very much!

Shujun

rwdavies commented 6 years ago

Hey,

Thanks for the inquiry. I must admit, from my perspective, 10X is pretty high coverage! I have used 10X as truth data on multiple occasions to validate lower coverage imputation. STITCH is probably best suited to lower coverage (<2X) datasets. If you are interested in further refinement of low confidence genotype calls in your 10X data, I might suggest something like IMPUTE2.

More generally, about the input generation, I've tried to make conversion from raw (.bam) to internal STITCH format as fast as I can by using SeqLib and C++. It doesn't quite seem right that your conversion is so slow. For your Chr1 tests, how many SNPs are you looking at and what is the SNP density? I'm wondering if the SNP density is very high, the conversion might become slow.

Robbie

oushujun commented 6 years ago

Hi Robbie,

Thank you for your inputs. I haven't tried IMPUTE2 yet, but I am trying Beagle. I haven't got any results yet due to the high memory issue in larger regions. Now I am running on default settings (which is also for human) and it was already quite slow - it is estimated >10h to run on a 50Kb region. I am sorry this is off the topic, but do you know if IMPUTE2 or any other programs can perform faster non-referenced imputation?

As for the raw bam file, you are right, the SNP density is quite high in the dataset, which is already filtered and still has 2.5M on Chr1 (40Mb) and about 20bp/SNP.

I have another question, I learned from #4 that varying sequencing depth will potentially result in the bias to high-coverage loci. My situation is, the population is not only varying in sequencing depth, but also has very strong population structure. The population has both cultivated rice (10-15X) and wild rice (1-2X). Cultivated rice has two main subpopulations (FST=0.55). Wild rice has much higher nucleotide diversity (up to 5 times higher than one of the cultivated subpopulation), and hence larger Ne and shorter LD. How do these population parameters affect the imputation results in STITCH (or in general)? Is it a bad idea to impute all these together? Any suggestions are appreciated!

Thanks! Shujun

rwdavies commented 6 years ago

Hey,

1 SNP every 20bp is pretty dense. I think those run times are to be expected. It's been awhile since I've run IMPUTE2 or Beagle on humans but I seem to remember chunking the genome into ~2000-5000 SNP bins (input sites) and it taking ~10-30 hours to run each chunk.

There are certainly other pieces of software out there that will run on non-human populations. Many of them I think either take advantage of having array data or inbred samples, so I'm not quite sure if they're what you're looking for here.

One somewhat crazy option is that if you have a lot of related samples from the cultivated subpopulations, your imputation using 10X might not be that much better than 1X, so you could downsample all samples to <2X and try running STITCH with that? I would run the downsampling using something external like samtools. If you go this approach you may also be able to get away with other speedups like using fewer than 40 iterations. I would try this first on some small region like your 50 kbp region you're testing using Beagle / IMPUTE2 to see how it compares for speed / accuracy.

About the parameters for imputation in general, I don't think I've come across a situation as extreme as you find yourself in, but in my experience it's been useful to run samples together whenever possible (including imputation), to minimize any batch specific artefacts that might arise (this depends on the specific model / program used, but samples imputed together tend to look more like each other than imputed separately). If you dig deep into the method, you'll find what the parameters refer to in the model. For STITCH specifically, the model is of each observed haplotype being a mosaic of a set of ancestral reference haplotypes, which are built internally. In your case, with cultivated rice, I would set nGen to be more like the number of generations since rice cultivation (your wild and cultivated samples could be approximated as having descended from a small number of representative ancestors at that point). I'm going to guess rice was cultivated about 10,000 years ago and has 1 generation / year, in which case I'd set nGen more like 10000 x 1. It's not perfect, but not an unreasonable simplification.

Robbie

oushujun commented 6 years ago

Hi Robbie,

Thank you for your great advice regarding the software experience and parameter settings. Just want to update you on a recent task: there is a sample (33X, 13GB/0.4Gb genome) running on input file conversion for 12 days already and still running. I don't mean doing the project this way, but just curious about how long it could go. I notice the task is using a single CPU - is that possible to make this step parallelized (i.e., convert bam files one-by-one with all provided CPUs), so that the IO pressure can go away?

Best, Shujun

rwdavies / STITCH

Slow input file conversion #8