phyloacc / PhyloAcc

PhyloAcc a software to detect the changes of conservation of a genomic region
GNU General Public License v3.0
26 stars 12 forks source link

phyloacc_post.py getting hung up on inf and -inf values #54

Closed AlexWeitzel closed 3 months ago

AlexWeitzel commented 9 months ago

Hello,

When calling phyloacc_post.py I receive the following error.

Screen Shot 2023-09-14 at 11 37 12 AM

This appears to be due to 22 out of ~8,000 loci having -inf values for marginal.likelihood.m0, and inf values for logbf1 and logbf3

Screen Shot 2023-09-14 at 11 50 08 AM

I was able to get around this by adding filtered_bf1s = [v for v in bf1s if not math.isinf(v)] and then plotting filtered_bf1s instead of bf1s (and doing the same for bf3s) in phyloacc_lib/plot.py, but I was wondering if you knew why the inf values were occurring in the first place.

Thank you, Alex

gwct commented 9 months ago

Hey there, Thanks for pointing this out and providing the line of code that filters the infs. I'm not sure why you'd be getting them in the output to begin with. Like you said, they are all -inf for the M0 marginal likelihood column, and that causes BFs 1 and 3 to then be inf.

One thing to check would be the original elem_lik.txt for a given batch to see what the M0 marginal likelihood is there. For instance, the first line in your second screenshot has an id of 182-5. If you went into the phyloacc-job-files/phyloacc-output/182-phyloacc-*-out/ directory and looked at the 182_elem_lik.txt file, what is the M0 entry here for element id 5? Note that this file has a slightly different format than the final elem_lik file, so the M0 likelihood is the column labeled "loglik_Null_W". If this is not also inf, then something is going wrong with the phyloacc_post.py script. Otherwise, we'd have to ask @xyz111131 or @HanY-H why this model is producing an inf.

AlexWeitzel commented 9 months ago

Hello,

Thanks for your response. The loglik_Null is -inf the corresponding logBF1 is inf in the 182_elem_lik.txt file.

Screen Shot 2023-09-18 at 10 57 08 AM

Hope this helps, Alex

gwct commented 9 months ago

Ok, so its likely that those elements are just failing to converge. I'm not sure why this happens for a particular element, but I think it is expected to happen sometimes, resulting in the -infs in the model fits.

But the phyloacc_post.py script should definitely handle the infs better by reporting how many there are and excluding them from the list when plotting, which seems to break matplotlib. I will keep this issue open as we address that.

AlexWeitzel commented 8 months ago

Hello, just an update. I just encountered a scenario where the logbf2 was also inf. So it appears logbf1, logbf2, and logbf3 can all give inf values. I just wanted to amend my previous post where I had only seen logbf1 and logbf3 as inf. The same fix by filtering out inf values applies.

I have also taken a look at some of the fasta files that have generated these errors, and by eye they seem to contain significant gapped regions if that gives any clue as to how this is occurring.

Thanks, Alex

gwct commented 8 months ago

Ok, thanks for the update. I've got a list of bug fixes that I'll be implementing soon, so I'll be sure to include this.

gwct commented 3 months ago

This has been addressed in v2.3.1.