Advice on parsing results and getting allele frequencies

davidecarlson commented 4 years ago

Hi Jeffrey,

Thanks for making TEFLoN available. I've run the pipeline on a group of 12 individuals, and now I'm working on parsing and understanding the results. I'm trying to follow a set of principles that are relatively similar to what you and co-authors did in the 2017 GBE paper.

To call a "present" genotype for a TE insertion, I'm currently requiring:

3 or more "presence" reads (column 10) in that sample
Ratio of "presence" reads to all reads (column 13) must be 0.75 or greater in that sample

To call an "absent" genotype for a TE insertion, I'm requiring:

- 3 or more "absence reads (column 11) in that sample

1 or fewer "presence" reads (column 10) in that sample
Ratio of "presence" reads to all reads (column 13) must be 0.25 or smaller in that sample

Does the above seem sensible?

Also, ideally I would like to calculate the frequencies of particular TE insertion alleles across all my 12 samples, but for most TE insertions, I tend to have an ambiguous genotype call (column 13 = -9) in one or more samples, which obviously makes things calculating the frequency more complicated.

My current thought is to set a threshold for minimum # of samples with an unambiguous genotype call, and then estimate the allele frequency for each TE insertion using the samples that have a called genotype for that TE insertion.

Does an approach like this seem reasonable to you? Thanks for an advice! Dave

jradrion commented 4 years ago

Hi Dave,

Does the above seem sensible?

Yes, I think the criteria you described seems sensible. You could also try comparing different confidence thresholds, for example not allowing any presence reads for a TE being called as absent in one individual in a "high confidence" set of calls, and then having more lax requirements for intermediate or lower confidence sets. This might help you convince yourself of any potential patterns you find.

Does an approach like this seem reasonable to you?

Again, I think this is a reasonable thing to do. I would hesitate to include samples with a -9 in column 13 for anything that you want to have high confidence in, but you could also simply sum up the presence and absence counts across all samples and take the fraction of presence over total counts. I imagine this would give you a reasonable approximation to actually having good genotypes called on the individuals, but I've never actually tested this.

Cheers, Jeff

davidecarlson commented 4 years ago

Hi Jeff, Thanks for the advice. It's very helpful. I'll go ahead and close this. Best, Dave

jradrion / TEFLoN

Advice on parsing results and getting allele frequencies #3