slowkoni / rfmix

RFMIX - Local Ancestry and Admixture Inference Version 2
72 stars 25 forks source link

Stuck at generating simulation samples #5

Open antmarge opened 6 years ago

antmarge commented 6 years ago

When I run RFMix, it seems to get to the "generating simulation samples" stage within a few minutes, but stay there for 30+ hours. I've been running on 20 threads, 120GB with just chr 21 (~120k snps), a reference panel of 19 samples (3 continental populations), and 1 query sample. I've been trying with various sizes of inputs, computational resources, number of generations (default to 50), and EM parameters, but always get "stuck" at this stage. Any ideas?

Loading genetic map for chromosome 21 ...  done
Mapping samples ... 19 samples combined
Scanning input VCFs for common SNPs on chromosome 21 ...   100160 SNPs
Loading haplotypes... done
Defining and initializing conditional random field...
   setting up CRF points and random forest windows...
   computing random forest window spacing overlay...
   initializing apriori reference subpop across CRF...
   setting up random forest probability estimation arrays... done
Defining and initializing conditional random field...   done
1240959 (32.6%) variant alleles 0 (0.0%) missing alleles

Generating internal simulation samples...
slowkoni commented 6 years ago

You need more reference samples. Just double what you have by making a copy of each individual.

Alternatively, you can bypass this step by adding the option --crf-weight=3 (Please note the double dash for this option, autocorrect in these emails likes to switch it to a long dash single character that won’t cut and paste correctly)

The simulation step is actually to determine the CRF weighting parameter at runtime, which depends highly on the density of your data and also the settings you use, and I sort of just pulled 3 out of my backside here. Without the simulation, its as good a place to pull it from as any. But it won’t go horribly wrong if that parameter suboptimal or even quite far off from what it should be. If the output looks too “choppy”, then lower it. If it seems like it isn’t detecting introgressions from other populations of lengths that you would expect, increase it, and also check the number of generations since admixture parameter which defaults to 8 if not set otherwise. I may have that backwards, but you’ll be able to see whether it makes much difference or not.

If you can increase your reference panel, or if you try the trick of just including every sample you do have twice (technically this adds no information, but as long as every sample is doubled, not just some of them, it does not introduce any bias or increased variance), and the program can complete the simulation step, it will then give you estimates of accuracy on your data, subject to the number of generations since admixture being correctly set. The CRF weight and number of generations sort of affect the same thing, but one has a biological interpretation and presumably one would have a principled reason for why they think its value is whatever they are setting it to, and the other is just something the math requires but the only way we can figure out its value is by simulation and empirically optimizing it. There is no principled way of doing it.

Best, Koni

On Jun 12, 2018, at 9:52 AM, Margaret Antonio notifications@github.com wrote:

When I run RFMix, it seems to get to the "generating simulation samples" stage within a few minutes, but stay there for 30+ hours. I've been running on 20 threads, 120GB with just chr 21 (~120k snps), a reference panel of 19 samples (3 continental populations), and 1 query sample. I've been trying with various sizes of inputs, computational resources, number of generations (default to 50), and EM parameters, but always get "stuck" at this stage. Any ideas?

Loading genetic map for chromosome 21 ... done Mapping samples ... 19 samples combined Scanning input VCFs for common SNPs on chromosome 21 ... 100160 SNPs Loading haplotypes... done Defining and initializing conditional random field... setting up CRF points and random forest windows... computing random forest window spacing overlay... initializing apriori reference subpop across CRF... setting up random forest probability estimation arrays... done Defining and initializing conditional random field... done 1240959 (32.6%) variant alleles 0 (0.0%) missing alleles

Generating internal simulation samples... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/slowkoni/rfmix/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AJapKQvSfgOBtUGHwAlgLsHOuH3lqXe_ks5t7_HDgaJpZM4Uky-y.

antmarge commented 6 years ago

Thanks for the advice! Increasing reference sample number or using the --crf-weight parameter worked. I had used too many samples (~500 per population) and then too few (~6 per population) in an effort to troubleshoot. Indeed, as you suggest, holding --generations constant, the results are increasingly less "choppy" (i.e. many tiny ancestry blocks) as I increase --crf-weight from 5 to 15 to 30 to optimized with simulation.

antmarge commented 6 years ago

Is there a way (e.g. by tuning other parameters) to have more flexibility with the --generations parameter? I expect more than one introgression event in my data (at least one >50 generations past and one <6 generations past). Do these need to be run and interpreted separately?

slowkoni commented 6 years ago

If introgression from the same donor population into the same receiving population, and no suspicion of back introgression, then just set the generations parameter high. If you potentially have back introgression from recipient into donor, like we do with rice and it’s wild relative, then the model is really not sophisticated enough and it’s going to be difficult to ever distinguish the two or three events definitively, but in this case I’d set the generations parameter to something in the middle of the times you suspect the two forward introgressions. Your mileage may vary, sort of interested to hear how this works out for you.

Wishing for the best, Koni

-- Mark Koni Wright, Ph. D. NOTE: This account used to trap spam from sites that require an email login.

Use slowkoni at icloud dot com as my personal account.

Use markoni.wright at gmail dot com for non-Stanford professional or consulting business email

Use mhwright at stanford dot edu for Stanford or Cornell email. mhw6 at cornell dot edu will work for life but Stanford spam filters it.

Sent from my iPhone

On Jun 14, 2018, at 10:28, Margaret Antonio notifications@github.com wrote:

Is there a way (e.g. by tuning other parameters) to have more flexibility with the --generations parameter? I expect more than one introgression event in my data (at least one >50 generations past and one <6 generations past). Do these need to be run and interpreted separately?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

hguturu commented 6 years ago

We seem to be having this problem as well. The runs get stuck at generating simulation samples though there are more than 100 samples per population. But, when we tried setting the --crf-weight from 1 to even 30, all returned as 'Initial analysis - logl -inf’. Any thoughts on the the problem here?

Does the input use reference/alternative allele? Is it ok to randomly assign an allele to reference or alternative to fit VCF format?

slowkoni commented 6 years ago

If at all possible (or, a complete or suitable high quality reference genome sequence exists) and so a reference allele is well defined, the VCF format specification should be followed and the 4th column always the allele found in the reference sequence, and the alternate allele in the 5th column, and then in phased genotypes appearing following the 9th column, these alleles are referred to as 0 and 1 respectively in the GT fields.

That said, technically it should be the case that it is only required that both the query VCF file and the reference populations VCF file are absolutely consistent in which allele is assigned as reference, but if which is truly reference is unknown, it doesn’t matter, as long as both files assign alleles the same way for all markers - and the 0 and 1 allele codes in the genotypes absolutely always correspond correctly to the indicated alleles defined in 4 and 5. RFMIX won’t detect allele flips or strand flips for you and correct them on the fly. I suppose it should sniff for evidence of them though and warn or stop if allele flips or stranding issues are detected.

If you are doing it randomly and inconsistently between files, well it should be no surprise at all that the program can’t figure out which segments of the query samples correspond to which of the reference populations - you might as well be trying to match up Tibetan text in romanized form (Wylie spelling for instance) with each of English, French, and German, all of which are word for word transliterations of each other but confirming to correct grammar for the respective language, and then determine which of the those 3 reference languages the Tibetan text is derived from. Answer - none, because Tibetan is not related at all to any of those languages.

Before we move any further of this please check the consistency of allele assignments to reference and alternative, and also that 0 and 1 codes in the genotypes correspond - or, if you need to flip allele order for some markers to have the files correspond, you also need to flip the 0s and 1s in the genotypes when you do that.

If this indeed a problem here, then I’m surprised you didn’t run into serious problems when phasing the data - the input to RFMIX must be phased-resolved first, for both reference and the query inputs. It works on haplotypes, not on diploid genotypes.

-- Mark Koni Wright, Ph. D.

Sent from my iPhone

On Jul 22, 2018, at 08:33, hguturu notifications@github.com wrote:

We seem to be having this problem as well. The runs get stuck at generating simulation samples though there are more than 100 samples per population. But, when we tried setting the --crf-weight from 1 to even 30, all returned as 'Initial analysis - logl -inf’. Any thoughts on the the problem here?

Does the input use reference/alternative allele? Is it ok to randomly assign an allele to reference or alternative to fit VCF format?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

hguturu commented 6 years ago

Thanks for the response. We were matching reference and alternative allele between training and testing and it's phased data. Upon closer inspection, we released the issue was due to having a malformed genetic map file.

JasonTan-code commented 2 years ago

I had the same issue in the last few days. Here is one possible reason: the genetic map file has repeated telomere cM positions. One example is:

22 16050408 0 22 16050612 0 22 16050678 0 22 16050984 0 22 16051107 0 22 16051249 0 22 16051347 0 22 16051453 0.000858258064516 22 16051477 0.00105258064516 22 16051480 0.00107687096774 22 16051497 0.00121451612903

If that is the case, remove 0s and duplicated numbers on the third column.