sanger-pathogens / Bio-Tradis

A set of tools to analyse the output from TraDIS analyses
https://sanger-pathogens.github.io/Bio-Tradis/
Other
21 stars 29 forks source link

How to deal with biological replicates - tradis_essentiality.R #100

Closed subwaystation closed 5 years ago

subwaystation commented 5 years ago

Hi @lbarquist ,

I am wondering how to deal with biological replicates (e.g. 3) when calculating the essential genes using

tradis_essentiality.R

Do you just take the median of the insertion sizes 3 replicates, calculate the insertion index once and then plug them into the script? Or is there a way to calculate an interval range? Any other ideas? Thanks!

Best, Simon

lbarquist commented 5 years ago

Hi Simon,

This isn't something I've really done before, since most projects I've worked on have just had a single input library.

I think the median approach you suggest should work? If you wanted to get really fancy, you could probably put errors/deviations on these medians (median absolute deviations across the replicates is one idea?) and use these in some kind of significance calculation. Whether or not this is overkill probably depends on how certain you need to be in your results downstream - from my experience looking at essential genes from TraDIS experiments in different bacteria, I suspect this might take a lot of work to get to more or less the same result in the end, though I'd be quite happy to be proved wrong.

Of course the easiest thing if you just want a quick result would just be to run the script three times on your three libraries, check that they agree, and maybe exclude things that are borderline or switch categories in different experiments.

Feel free to email me directly if you want any advice on specific data, or just a second set of eyes to sanity-check.

-Lars

subwaystation commented 5 years ago

Hi Lars,

if you have only worked with a single input library, I am wondering from where you got the statistical power for

tradis_comparison.R

? I would expect edgeR to require at least 2 replicates of a group, right? I am having 3 independent biological replicates per group.

Well, the median approach was the best one I could think of. So instead of taking the median I take the MAD, did I understand that right? And find a significance how? If you could elaborate on that (or email me personally) I would be very happy.

I am not quite sure if running the thing 3 times in a row would be statistically correct. Of course we could perform as you suggest, but we would not able to take any technical or human bias into account? I would expect that each replicate might behave differently. Lets say I have 0 | 1 | 25 as insertion counts for one replicate and gene. The median would be 1, but the question is, would that be close to the biological truth? If your experience tells you that, then I would be happy to follow that way. However, how would I justify that in a paper? You think that would fly?

I have to ask, if I can forward you any data, then I will come back to you! Thanks for the offer.

Best, Simon

lbarquist commented 5 years ago

Hi Simon,

Generally the way we (and I think most labs, at least for in vitro work where population bottlenecks are not an issue) do these sorts of experiments is to generate a single dense base transposon mutant library, then grow this up in independent replicates and compare these. Examples of comparisons might be something like the input culture and bacteria recovered from an organ for an infection model, or growth in rich media over some time period with and without an antibiotic. So the replication is done at the level of the experiment, and not at the level of generating the mutant library.

As you've found, tradis_essentiality.R only deals with the single base library. If you've created multiple independent dense transposon mutant libraries, it's probably possible to create something more robust than tradis_essentiality.R for essentiality calling. This is not something I've done, and I'm not aware of anyone who has. The question is really whether it makes sense to spend the time to do this, and this depends on the scope of the statement you want to make based on this data. If all you want is to be able to say "we found X genes were essential, and Y% are shared with E. coli" (or similar), just running the analysis 3 times and verifying it is consistent is probably sufficient. If you're planning to write an entire paper just about the gene essentiality, then it might be worth spending the time thinking about different ways of approaching this data.

Anyway, I think this is getting off-topic for an issue report -- please close the issue & email me directly if you want to discuss this further.

subwaystation commented 5 years ago

Thanks! Sure, I will come back to you.