"-Inf" values of obs_phenotype

hyunhwan-jeong commented 6 years ago

To simulate growth screening, I wrote below notebook:

https://gist.github.com/hyunhwaj/b941b84224d22b9b4bc112ee3b904d38

After I ran differences_between_bins, I am able to have bc_counts variable, but all of the values of obs_phenotype are -Inf. I don't see the issue when I simulate FACS screening. I am doubting the -Inf values are correct, and we might have other values which are close to theo_phenotype. Also, it is possible my code on the notebook is wrong. I will appreciate if it is true and you inform me what mistakes were in the code.

Thank you,

Hyun-Hwan Jeong

tlnagy commented 6 years ago

Sorry it took me so long to get to this. obs_phenotype is only used for FACS screens so it shouldn't be a problem for a Growth Screen.

obs_phenotype is initialized to -Inf here

https://github.com/tlnagy/Crispulator.jl/blob/master/src/simulation/common.jl#L32

and then is updated when you call select

https://github.com/tlnagy/Crispulator.jl/blob/master/src/simulation/selection.jl#L19-L20

tlnagy commented 6 years ago

After looking at this, it looks like obs_phenotype is a bit unnecessary since the "observed" phenotype is the counts of each guide in either bin (bottom vs top percentile in FACS and the beginning vs end for growth screens). There is actual no reference to obs_phenotype outside of the two lines in the previous commit.

tlnagy commented 6 years ago

Also, @hyunhwaj. Sometimes running your first three blocks of code (cells), gives the following the following error:

ArgumentError: median of an empty array is undefined, Float64[]

Stacktrace:
 [1] median!(::Array{Float64,1}) at ./statistics.jl:590
 [2] median(::DataArrays.DataArray{Float64,1}) at /Users/tamasnagy/.julia/v0.6/DataArrays/src/operators.jl:651
 [3] #differences_between_bins#80(::Symbol, ::Symbol, ::Function, ::Dict{Symbol,DataFrames.DataFrame}) at /Users/tamasnagy/.julia/v0.6/Crispulator/src/simulation/processing.jl:13
 [4] differences_between_bins(::Dict{Symbol,DataFrames.DataFrame}) at /Users/tamasnagy/.julia/v0.6/Crispulator/src/simulation/processing.jl:5
 [5] include_string(::String, ::String) at ./loading.jl:522

This is because you're only simulated 10 genes and sometimes there's no negative control guides (since you set their frequency at 5%).

If this happens, I recommend just rerunning till there is a negative control or increase the frequency of negative controls or increase the total number of genes.

hyunhwan-jeong commented 6 years ago

Thanks, @tlnagy for your detailed answers regarding my question. I have further questions:

Did you mean that obs_phenotype is no longer need to be in the bc_counts anymore for both types of screens?
I want to make sure max_phenotype_dists is the parameter to set up proportions number of genes for each type of phenotype randomly. Am I understanding correctly?
Does any negative control genes has to be in the simulation? If it is, then could you briefly explain any reason why it has to be in the simulation?

Thanks again,

Hyun-Hwan Jeong

tlnagy commented 6 years ago

Correct. If you use the master version of the package, you'll see that I removed the obs_phenotype column from the DataFrame. It was an unnecessary relic. See http://tamasnagy.com/Crispulator.jl/latest/internals.html#Simulation.differences_between_bins for the meaning of each column. I just fleshed out the documentation so hopefully it'll help. Let me know if anything isn't clear still and I can fix the docs.
That is correct.
The negative control genes need to be in the simulation. They are essentially nonsense targets for a fraction of the guides (nonsense guides). I normalize the frequencies of all other guides to the median frequency of the negative control guides. This line here: https://github.com/tlnagy/Crispulator.jl/blob/b66357e3d8e834ba5dec6d771c44ef33f5d90d2b/src/simulation/processing.jl#L88-L89 I also use them as a null distribution against which to compute the statistical significance of fold change for each gene. See here: https://github.com/tlnagy/Crispulator.jl/blob/b66357e3d8e834ba5dec6d771c44ef33f5d90d2b/src/simulation/processing.jl#L110-L114 The basic idea is that negative control guides give a good idea of the noise in the system and if we compare how consistently shifted the frequencies of the guides of a certain gene are from this null distribution we can get a good idea of the significance of the shift. This approach was originally published here: http://www.pnas.org/content/110/25/E2317 and used with ultra-complex shRNA libraries

an important innovation in our strategy is the comparison of the shRNAs targeting each gene with the negative control shRNAs (Fig. 3A). The large set of negative control shRNAs provides the appropriate “null distribution” by controlling for both the measurement noise and unintended off-target effects.

but it was also applied to sgRNAs in a CRISPRi screen (https://www.sciencedirect.com/science/article/pii/S0092867414011787?via%3Dihub)

For each gene, a Mann-Whitney p value is calculated using CRISPRi/a sgRNA activity relative to a negative control distribution for 24 subsampled sgRNAs.

I hope that helps!

tlnagy / Crispulator.jl

"-Inf" values of obs_phenotype #52