sanger-pathogens / Bio-Tradis

A set of tools to analyse the output from TraDIS analyses
https://sanger-pathogens.github.io/Bio-Tradis/
Other
22 stars 29 forks source link

Essentiality analysis code #117

Closed apredeus closed 4 years ago

apredeus commented 4 years ago

Hello,

I was wondering about the piece of code in the beginning of tradis_essentiality.R:

h <- hist(ii, breaks=200,plot=FALSE)
maxindex <- which.max(h$density[10:length(h$density)])
maxval <- h$mids[maxindex+3]

I understand why do we skip first 9 bins (even though in my distribution it's too much) - but why do we offset it by 3 when looking for max value? Seems insignificant, but I've tried changing these numbers, and the number of genes deemed essential is changing by 100 sometimes (700 vs 800 - the library is not very dense, about 100k unique insertions).

Thank you

-- Alex

lbarquist commented 4 years ago

Hi Alex,

It's just selecting a point after the second maxima -- the distributions are then separated using loess to find a minima between the two maxima for the subsequent curve fitting. It's important to note that the algorithm is heuristic, and may fail if your data doesn't look like what we've built it for, though it's been used on a fairly diverse range of data sets.

It's hard to say without seeing your data, but I suspect the problem is that you probably don't have sufficient density in your library for essentiality analysis using this script -- I'm guessing non-essential genes aren't saturated. This would be clear in the plots the script outputs; if you don't see two clear peaks in the distribution, that's the reason. In this case your best bet might be to just select all genes with no insertions as your candidate "essential" set.

-Lars

apredeus commented 4 years ago

Hello Lars,

thank you for your reply. Yes, the libraries are somewhat sparse (100-150k unique insertions per condition), however, the difference between the two strains we're using would make the approach hard (one strain has ~ 300 zero-insertion genes and another about ~ 600).

However, I was making a more immediate point of number discrepancy - 10 vs 3 in the code I've listed. It seems like the purpose of it is to drop several high histogram bins on the left. But if you're dropping first 9 bins, shouldn't the maxval be h$mids[maxindex+9]? And why drop first 9 bins out of 200?

Below is the distribution I observe in a typical experiment. Left arrow points to where current Bio-Tradis code identifies the maximum to be. Right one is what you get when you change 10 to 4. This changes the number of genes used in Loess smoothing by ~ 1000, and thus influences the exponential and gamma fits quite dramatically.

image

lbarquist commented 4 years ago

Hi Alex,

I'm not sure I can give you a satisfactory answer for "why" for some of these values, besides that they've worked for us and our data. The algorithm is heuristic and it will not be applicable to every data set, this is why we output diagnostic plots to check it. It's also worth pointing out that this kind of analysis only does a rough classification of potentially essential genes, I don't think it's going to solve the problem of insufficient density in one data set.

The purpose of dropping the first 9 bins is just to get rid of the initial essential peak. I am not sure at this point why the second value was set to maxval + 3 rather than maxval + 9. The only real point of this is to get past the local minimum between the two peaks for the loess fitting.

You're welcome to modify these values if they don't work for your data, as it sounds like you've done. If you have a clear case where the method is failing badly but you think it should work, you can email me directly with the details and data and I'd be happy to take a look.

-Lars

apredeus commented 4 years ago

Thank you for the explanation!