Closed hyunhwan-jeong closed 6 years ago
Sorry it took me so long to get to this. obs_phenotype
is only used for FACS screens so it shouldn't be a problem for a Growth Screen.
obs_phenotype
is initialized to -Inf
here
https://github.com/tlnagy/Crispulator.jl/blob/master/src/simulation/common.jl#L32
and then is updated when you call select
https://github.com/tlnagy/Crispulator.jl/blob/master/src/simulation/selection.jl#L19-L20
After looking at this, it looks like obs_phenotype
is a bit unnecessary since the "observed" phenotype is the counts of each guide in either bin (bottom vs top percentile in FACS and the beginning vs end for growth screens). There is actual no reference to obs_phenotype
outside of the two lines in the previous commit.
Also, @hyunhwaj. Sometimes running your first three blocks of code (cells), gives the following the following error:
ArgumentError: median of an empty array is undefined, Float64[]
Stacktrace:
[1] median!(::Array{Float64,1}) at ./statistics.jl:590
[2] median(::DataArrays.DataArray{Float64,1}) at /Users/tamasnagy/.julia/v0.6/DataArrays/src/operators.jl:651
[3] #differences_between_bins#80(::Symbol, ::Symbol, ::Function, ::Dict{Symbol,DataFrames.DataFrame}) at /Users/tamasnagy/.julia/v0.6/Crispulator/src/simulation/processing.jl:13
[4] differences_between_bins(::Dict{Symbol,DataFrames.DataFrame}) at /Users/tamasnagy/.julia/v0.6/Crispulator/src/simulation/processing.jl:5
[5] include_string(::String, ::String) at ./loading.jl:522
This is because you're only simulated 10 genes and sometimes there's no negative control guides (since you set their frequency at 5%).
If this happens, I recommend just rerunning till there is a negative control or increase the frequency of negative controls or increase the total number of genes.
Thanks, @tlnagy for your detailed answers regarding my question. I have further questions:
obs_phenotype
is no longer need to be in the bc_counts
anymore for both types of screens?max_phenotype_dists
is the parameter to set up proportions number of genes for each type of phenotype randomly. Am I understanding correctly? Thanks again,
Hyun-Hwan Jeong
Correct. If you use the master version of the package, you'll see that I removed the obs_phenotype
column from the DataFrame. It was an unnecessary relic. See http://tamasnagy.com/Crispulator.jl/latest/internals.html#Simulation.differences_between_bins for the meaning of each column. I just fleshed out the documentation so hopefully it'll help. Let me know if anything isn't clear still and I can fix the docs.
That is correct.
The negative control genes need to be in the simulation. They are essentially nonsense targets for a fraction of the guides (nonsense guides). I normalize the frequencies of all other guides to the median frequency of the negative control guides. This line here: https://github.com/tlnagy/Crispulator.jl/blob/b66357e3d8e834ba5dec6d771c44ef33f5d90d2b/src/simulation/processing.jl#L88-L89 I also use them as a null distribution against which to compute the statistical significance of fold change for each gene. See here: https://github.com/tlnagy/Crispulator.jl/blob/b66357e3d8e834ba5dec6d771c44ef33f5d90d2b/src/simulation/processing.jl#L110-L114 The basic idea is that negative control guides give a good idea of the noise in the system and if we compare how consistently shifted the frequencies of the guides of a certain gene are from this null distribution we can get a good idea of the significance of the shift. This approach was originally published here: http://www.pnas.org/content/110/25/E2317 and used with ultra-complex shRNA libraries
an important innovation in our strategy is the comparison of the shRNAs targeting each gene with the negative control shRNAs (Fig. 3A). The large set of negative control shRNAs provides the appropriate “null distribution” by controlling for both the measurement noise and unintended off-target effects.
but it was also applied to sgRNAs in a CRISPRi screen (https://www.sciencedirect.com/science/article/pii/S0092867414011787?via%3Dihub)
For each gene, a Mann-Whitney p value is calculated using CRISPRi/a sgRNA activity relative to a negative control distribution for 24 subsampled sgRNAs.
I hope that helps!
To simulate growth screening, I wrote below notebook:
https://gist.github.com/hyunhwaj/b941b84224d22b9b4bc112ee3b904d38
After I ran
differences_between_bins
, I am able to havebc_counts
variable, but all of the values ofobs_phenotype
are-Inf
. I don't see the issue when I simulate FACS screening. I am doubting the-Inf
values are correct, and we might have other values which are close totheo_phenotype
. Also, it is possible my code on the notebook is wrong. I will appreciate if it is true and you inform me what mistakes were in the code.Thank you,
Hyun-Hwan Jeong