slowkoni / rfmix

RFMIX - Local Ancestry and Admixture Inference Version 2
79 stars 25 forks source link

Ideal referent variant amount and genetic map to speed up analysis. #47

Closed nick338 closed 11 months ago

nick338 commented 1 year ago

Hi,

I have been able to successfully run rfmix however the time and file size it takes to run for my current sample is quite large. Are there any options in rfmix I can set to improve speed of analysis and reduction in file size?

I am attempting to perform local ancestry inference on a population split into 5 ancestral groups, would reducing the number of groups improve speed and time? As well would performance improve if I were to filter out variants on both or either of my referent vcf file or sample vcf file? Lastly, do you have any recommendations for an ideal genetic map file to use for rfmix (hg38). Currently I have tried genetic maps from: https://www.science.org/doi/10.1126/science.aau1043, and https://alkesgroup.broadinstitute.org/Eagle/downloads/tables/

Thank you for any help you can provide, Nick

samreenzafer commented 11 months ago

Hi, wondering if you ever got a response to both your questions on performance and hg38.?

slowkoni commented 11 months ago

As long as your reference panel and sample panel are using the same coordinate system, either hg37 or hg38 in this case, it shouldn't matter so much which one you use. Just use the one that is most convenient for you. Generally that is determined by your reference panel and its source, and your samples are either already using the same reference genome build version, or you'll need to liftover your samples to the corresponding coordinates of the reference panel. In general I would not recommend the reverse practice, as your reference panel is generally a better curated set of data, but that depends on where you sourced it from or whether you generated it yourself. How to convert either VCF file to a different genome build coordinate system is beyond the scope of anything here, but it is a very common task and there is plenty of ink spilled on it, last I knew the most commonly used tool for it was called "liftOver" and the requisite chain files can be found at UCSC certainly for human, but many other well studied species as well that have multiple genome build versions or variations.

As for performance, rfmix v2.0 is optimized for performance on high CPU core count systems using shared memory multithreading (SMT). If wall clock end-to-end time is frustrating when doing different trials of parameters, or tweaking reference panel composition or anotation, I would recommend using AWS and just launching a very high vCPU system as spot instance. How to use AWS or other IaaS cloud providers is beyond the scope of anything here, but the recommended way to improve end-to-end time is to use a high CPU core count system, and verify that rfmix seems to be using all cores in the common unix process monitoring utility top. It should by default. If not, specify option --n-threads=<# of cpus> option. This option is intended more as a way to reduce the greediness of rfmix, because by default it will try to use everything it can find. If it isn't doing that by default, it isn't finding everything or the system has fewer cores than you think. The performance should follow a linear speed up scale by number of CPU cores available, but this has not been tested or proven beyond maybe 32 cores, not really sure that was ever thoroughly investigated.

Another option to improve wall clock end-to-end time during trial runs and optimization of parameters and panel composition is simply to run fewer samples in your sample set, or thin the data dropping some fraction of SNPs from the VCF at random. If you thinned by 50% (that's aggressive, but depends on the density you already have), I would expect it would run twice as fast. It might run much faster than that even. It's been a very long time since I've tried it.

The last thing to mention which is of paramount importance for runtime is the amount of missing data. Missing data will really slow it down and I'd say it would become intractable with only 10% missing data. The easiest fix for this is to impute the missing data. Since you've probably done this along with phasing as they are closely related problems, this is probably not your main problem. If you do have missing data present in your sample data, I'd be interested to know more. rfmix v1.0 could not handle missing data at all, not even a single genotype missing, so handling missing data is new in rfmix 2.0.

samreenzafer commented 11 months ago

Thank You. One follow up question, since you mentioned " run fewer samples in your sample set", can this mean we can run RFMIXv2 on smaller subsets of the VCF, and then merge the RFmix outputs, using simple scripting? If the local ancestry inference is done independently on every sample of the query VCF then this could be a good way to parallelize the step. I've already been using --n-threads 14 , and my imputed VCF has ~400K snps & 10K samples in just chromosome 22.

samreenzafer commented 11 months ago

Also, I did have a lot of missingness in the VCF which rfMIX couldn't handle so I used "bcftools +missing2ref" to fix those genotypes.

My project has 10K samples that were genotyped in three batched on three similar, but not same, Mega chips (hg19). I QC'd them separately, then phased each with shapeIT (using 1000G ref panel), then imputed each separately at the Michigan Imputation Server using 1000G Phase3 30X (hg38 beta) reference. This results in phased imputed VCFs in hg38 build, which I QC'd to exclude snps with imputation score <0.6.

I then merged the 3 imputed cohorts before running RFMIXv2. bcftools merge --threads $threads -O b $imputedfile1 $imputedfile2 $imputedfile3 | bcftools view -i 'F_MISSING < 0.05' -O b|bcftools +missing2ref -O z -o $output.vcf.gz -- -p -m

rfmix -f $output.vcf.gz \ -r 1000GP_hg38/ALL.chr$i.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \ -m 1000G_27022019.GRCh38.sample.global_ancestry.tsv \ -g 1000G_REF_hg38/genetic_map_hg38_chr$i.txt \ --chromosome=$i \ -o $input.deconvoluted \ --n-threads= $threads

In chr22,

snps in each imputed cohort.

400359 382692 417207

snps post merging filtering by F_MISS = 340,905

The reason I merged the cohorts was because I didn't know if we could merge the rfmix outputs, snp-by-snp. Just wanted to check with you if this is a sound approach, in your opinion.