Closed zhou-sumei closed 3 months ago
Hello, Happy to help. I've looked over everything using the data you sent and was able to reproduce the error. While the error handling should be better in this case, this is happening because the sequences in your alignments are too similar. In two alignments, there are no phylogenetically informative sites with most sites having no differences at all. In the other alignment (file name ending in 96), there are 3 informative sites, however there are also many invariant sites. This means among all the sampled quartets, it can't find any informative or decisive sites for almost all the sites in your alignment, so when it starts calculating concordance factors it runs into the division by 0 error when it reaches one of those sites.
Unfortunately, this means these alignments aren't suitable for a concordance factor analysis, and you also probably won't learn much by running them through PhyloAcc anyways -- since there is such little variation I doubt you will infer any accelerations on any lineage. Indeed, when I run without the adaptive mode, two loci don't even get passed to PhyloAcc for their lack of informative sites:
# INFO: 2 loci have 0 informative sites and will be removed from the analysis.
I will update the error handling for this case, so I'll keep this issue open until that is resolved. Please let me know if you have any other questions!
Thank you for your quick reply!
I got these CNEE by phastCons follow their manual (based 4d-sites and the first sites of codon ) all the results after remove the exon region were used to phyloacc analysis without any other filters. I have run phyloacc for whole genome data successfully but in order to get more accurate results now I try to run phyloacc-GT in batch.
Here I have another question: Thanks to your prompt message, I checked my alignments and I do noticed that some alignment only have 2-3 segregating sites in ~60bp, so should I filter this kind of alignments (too few segregating sits) for both phyloacc and phyloacc-GT analysis because they are unlikely to be the accerelated element? if it makes sense, what threshold shoud I set appropriately?Filtering similar alignmnet can also reduce running time (I have 1 million CNEE! ), it will be great!
Looking forward to using the "-r adaptive" mode as soon, and thank you for your efforts!
Yes, it could definitely help to filter elements with few informative sites. phyloacc.py
by default removes loci without any informative sites, but you could go beyond that and filter out other loci. Unfortunately, there isn't really a set rule for how many or what percentage of sites should be informative to yield accurate results. Any filtering could lead to false negatives. You could take a look at the aln-stats.csv
file in your phyloacc.py
output directory, which should tell you how many informative sites are in each alignment. This could help you get an idea for how many loci would be removed with various cutoffs for informative sites.
Beyond that, if you wanted to estimate the false negative rate from filtering loci with few informative sites, you'd have to do some filtering, then run the filtered elements anyway and see how many you infer accelerations in. This kind of defeats the purpose of the filtering, but could be useful to do on a small number of the filtered loci. Either way, this would take more time.
Hopefully that helps!
Thank you for your advise!
Hi, authors,
I'm trying to use this nice software to analysis my data , I met an error today when I tred to use the gene tree model with -r adaptive, here is part of the error info:
it seems like this problem happened when calculating sCF to pick elements run the gene model, I'm not sure whether this problem is related to the gap in my alignment? and how does phyloacc treats gap in the sequence.
When I use “-r gt”, it will be ok. But with “-r adaptive”, no matter with "-l " or with "--theta", there will be this error.
Here is the command I used:
And I have send my data to your email, hope it will provide more info, thanks a lot!