smithlabcode / preseq

Software for predicting library complexity and genome coverage in high-throughput sequencing.
https://preseq.readthedocs.io
GNU General Public License v3.0
78 stars 16 forks source link

ls_extrap: too many defects, negative CI, and causes #61

Closed iamTad closed 2 years ago

iamTad commented 2 years ago

When running lc_extrap without using defect mode (all default using input bam file), I get an error for a particular sample stating:

ERROR: too many defects in the approximation, consider running in defect mode

As recommended by the manual, I've run the command in defect mode (with the -D option) and I get a negative value for the lower CI at higher total reads:

< 416000000.0 total reads, lower CI = positive 416000000.0 total reads, lower CI = 192802711.1 417000000.0 total reads, lower CI = -178785167.5 '> 417000000.0 total reads, lower CI = negative

Question 1. I read the paper associated with this program (Predicting the molecular complexity of sequencing libraries) and don't quite understand what leads to having defects (or too many defects) in the approximation. My understanding from the supplement is that the frequencies (of each unique molecule?) affect this--but how?

Question 2. What could be causing this negative lower CI value (and the switch from positive to negative)?

andrewdavidsmith commented 2 years ago

@iamTad A defect is a pole and a zero that are close enough to each other. The problem with that definition is that if a pole and zero are extremely close, then they won't ever affect what we observe as rational values representable in our computers. So we would detect smoothness. So the idea of "close enough" but also "far enough" is tricky to communicate. If I could show you a plot, you'd understand right away. It's likely some very well-performing data sets have defects that we simply can't see in their corresponding estimated rational functions. If the defects are a problem, then they could lead to predictions that are on their way to positive and negative infinity very nearby, and those would clearly fall outside whatever CI might have been estimated. On the other hand, if the defect interacted with a CI estimate, you can imagine strange things also happening. I think at one point we knew how to simulate data such that the estimated RFAs had lots of defects, and could be clearly seen. I'll take a look for those this week.

andrewdavidsmith commented 2 years ago

Here is an old image showing a defect in a rational function approximation. This isn't the best example, but I don't have time to make one now. I might make another if I need to illustrate the concept in more detail. I wish the red line here was thinner and not dashed so the curvature could be seen better. You can see the defect at around x=30M reads. Approaching that value from below diverges going to -infty while from above it goes to +infty. If the red line were plotted at many more points, we would be able to see it go all the way down, and come back from the top of the image. I think we check using 3 criteria at all estimated points to identify a defect:

  1. non-negative and finite
  2. increasing
  3. negative second derivate

If any of those properties is violated around any of the estimated points, the corresponding coefficients are discarded. If this happens in bootstrap mode, then a new bootstrap is attempted, and we only require that some fraction of the bootstraps are successful. In your case, there were too many defects, surpassing the allowed fraction. This doesn't guarantee that the non-bootstrapped data would show a defect, so we recommend trying with the -D. However, if there is still a defect after using the -D, then it will affect the output. In the example image, the defect might not pose a problem if you understand to ignore estimates around the "30M" point, but more commonly if there is one defect there are many.

Screen Shot 2022-09-07 at 7 43 31 AM

The most common cause is if the data has some strange property or maybe just a broken file format. I suggest checking the counts histogram (available using the -v verbose output). When these defects happen unexpectedly, the precise cause can be difficult to pin down (e.g, maybe need to trace the calculation of coefficients) but usually it signals bad things for that library.

I suspect we need more/better documentation related to the -D argument.

iamTad commented 2 years ago

I appreciate the update! This helps to clarify the questions that I initially had.