phyloacc_gt with "-r adaptive" encounter : ZeroDivisionError: division by zero

zhou-sumei commented 3 months ago

Hi, authors,

I'm trying to use this nice software to analysis my data , I met an error today when I tred to use the gene tree model with -r adaptive， here is part of the error info:

04.04.2024 19:09:51 Reading species/branch groups Success 0.5642 0.00021 54.4375
1554.45703
04.04.2024 19:09:51 Getting FASTA files in input dir Success: 3 FASTA files found 0.56781 0.00136 54.4375
1554.45703
04.04.2024 19:09:51 Detecting compression of seq files Success: No compression detected 0.57169 0.00179 54.4375
1554.45703
04.04.2024 19:09:51 Reading input FASTA files Success: 3 files read 0.57715 0.00346 54.4375
1554.45703
04.04.2024 19:09:51 Calculating alignment stats Success: 3 alignments processed 0.61258 0.0328 54.59766
1554.45703
INFO: 2 loci have 0 informative sites and will be removed from the analysis. 04.04.2024 19:09:51 Sampling quartets Success 0.61841 0.00109 54.65625
1555.45703
04.04.2024 19:09:51 Calculating per-locus sCF Processed 0 / 3 alns...
04.04.2024 19:09:53 Calculating per-locus sCF Success 1.91905 1.29848 55.68359
1539.44922
04.04.2024 19:09:53 Averaging site counts In progress... Traceback (most recent call last): File "/picb/evolgen/users/zhousumei/miniconda/envs/phyloacc/bin/phyloacc.py", line 152, in globs = CF.scf(globs); File "/picb/evolgen/users/zhousumei/miniconda/envs/phyloacc/lib/python3.10/site-packages/phyloacc_lib/cf.py", line 367, in scf node_scfs.append(quartet_counts[node][q]['concordant-sites'] / quartet_counts[node][q]['decisive-sites']); ZeroDivisionError: division by zero

it seems like this problem happened when calculating sCF to pick elements run the gene model, I'm not sure whether this problem is related to the gap in my alignment? and how does phyloacc treats gap in the sequence.

When I use “-r gt”, it will be ok. But with “-r adaptive”, no matter with "-l " or with "--theta", there will be this error.

Here is the command I used:

phyloacc.py -d part_CNEE_fa/ -o Diplo_phyloaccGT_adaptive -m test.mod --theta -iqtree-path "/mypath/bin/iqtree" -coal-path "java -Xmx8g -jar /mypath/ASTRAL/Astral/astral.5.7.8.jar" -r adaptive --overwrite -t "DipaorRef;Dipxin;Dipdao;Dipiad" -nodes 1 -batch 100 -j 2 -p 20 -part "hpc" -time 100

And I have send my data to your email, hope it will provide more info, thanks a lot!

gwct commented 3 months ago

Hello, Happy to help. I've looked over everything using the data you sent and was able to reproduce the error. While the error handling should be better in this case, this is happening because the sequences in your alignments are too similar. In two alignments, there are no phylogenetically informative sites with most sites having no differences at all. In the other alignment (file name ending in 96), there are 3 informative sites, however there are also many invariant sites. This means among all the sampled quartets, it can't find any informative or decisive sites for almost all the sites in your alignment, so when it starts calculating concordance factors it runs into the division by 0 error when it reaches one of those sites.

Unfortunately, this means these alignments aren't suitable for a concordance factor analysis, and you also probably won't learn much by running them through PhyloAcc anyways -- since there is such little variation I doubt you will infer any accelerations on any lineage. Indeed, when I run without the adaptive mode, two loci don't even get passed to PhyloAcc for their lack of informative sites:

# INFO: 2 loci have 0 informative sites and will be removed from the analysis.

I will update the error handling for this case, so I'll keep this issue open until that is resolved. Please let me know if you have any other questions!

zhou-sumei commented 3 months ago

Thank you for your quick reply!

I got these CNEE by phastCons follow their manual (based 4d-sites and the first sites of codon ) all the results after remove the exon region were used to phyloacc analysis without any other filters. I have run phyloacc for whole genome data successfully but in order to get more accurate results now I try to run phyloacc-GT in batch.

Here I have another question: Thanks to your prompt message, I checked my alignments and I do noticed that some alignment only have 2-3 segregating sites in ~60bp, so should I filter this kind of alignments (too few segregating sits) for both phyloacc and phyloacc-GT analysis because they are unlikely to be the accerelated element? if it makes sense, what threshold shoud I set appropriately？Filtering similar alignmnet can also reduce running time (I have 1 million CNEE! ), it will be great!

Looking forward to using the "-r adaptive" mode as soon, and thank you for your efforts!

gwct commented 3 months ago

Yes, it could definitely help to filter elements with few informative sites. phyloacc.py by default removes loci without any informative sites, but you could go beyond that and filter out other loci. Unfortunately, there isn't really a set rule for how many or what percentage of sites should be informative to yield accurate results. Any filtering could lead to false negatives. You could take a look at the aln-stats.csv file in your phyloacc.py output directory, which should tell you how many informative sites are in each alignment. This could help you get an idea for how many loci would be removed with various cutoffs for informative sites.

Beyond that, if you wanted to estimate the false negative rate from filtering loci with few informative sites, you'd have to do some filtering, then run the filtered elements anyway and see how many you infer accelerations in. This kind of defeats the purpose of the filtering, but could be useful to do on a small number of the filtered loci. Either way, this would take more time.

Hopefully that helps!

zhou-sumei commented 3 months ago

Thank you for your advise!

gwct commented 3 months ago

The fix for the initial bug is released in v2.3.1.

phyloacc / PhyloAcc

phyloacc_gt with "-r adaptive" encounter : ZeroDivisionError: division by zero #57