Tim Barry
sceptre
Nextflow pipelineThis repository contains the sceptre
Nextflow pipeline. The sceptre
Nextflow pipeline is a command line utility that facilitates running
sceptre
(i) out-of-core on a laptop or desktop or (ii) in a
distributed fashion on a cluster or cloud. The sceptre
Nextflow
pipeline is highly scalable and memory-efficient; therefore, we
recommend using the sceptre
Nextflow pipeline instead of the R
interface to sceptre
in most cases.
sceptre
and ondisc
packages:# in R
install.packages("devtools")
setRepositories(ind = 1:4)
devtools::install_github("katsevich-lab/sceptre")
devtools::install_github("timothy-barry/ondisc")
sceptre
Nextflow pipeline.# on command line
nextflow pull timothy-barry/sceptre-pipeline
We describe the sceptre
pipeline arguments here. As an example we use
data from the
paper
“A genome-wide framework for mapping gene regulation via cellular
genetic screens” by Gasperini et al., 2019.
Four input files are required: (i) the multimodal ondisc matrix metadata
file (multimodal_metadata_fp
), (ii) the backing .odm file of the gene
ondisc matrix (gene_odm_fp
), (iii) the backing .odm file of the gRNA
ondisc matrix (grna_odm_fp
), and (iv) a data frame containing the set
of gene-gRNA group pairs to analyze (pair_fp
). On my (Tim’s) machine
these files are located in the following places:
# in R
gasp_offsite_dir <- paste0(.get_config_path("LOCAL_SCEPTRE2_DATA_DIR"), "data/gasperini/at_scale/")
multimodal_metadata_fp <- paste0(gasp_offsite_dir, "multimodal_metadata.rds")
gene_odm_fp <- paste0(gasp_offsite_dir, "gene/matrix.odm")
grna_odm_fp <- paste0(gasp_offsite_dir, "grna_expression/matrix.odm")
pair_fp <- paste0(gasp_offsite_dir, "neg_control_pairs.rds")
The multimodal ondisc matrix should satisfy the following conditions.
# in R
library(ondisc)
crispr_experiment <- read_multimodal_odm(odm_fps = c(gene_odm_fp, grna_odm_fp),
multimodal_metadata_fp = multimodal_metadata_fp)
gene_modality <- get_modality(crispr_experiment, "gene")
grna_modality <- get_modality(crispr_experiment, "grna_expression")
grna_group
indicating the “group”
to which each gRNA belongs. Typically, targeting gRNAs are grouped
according to the site that they target, and non-targeting gRNAs are
grouped randomly into sets of size two or three.# in R
grna_modality
#> A covariate_ondisc_matrix with the following components:
#> An integer-valued ondisc_matrix with 13182 features and 206271 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis.
#> A feature covariate matrix with columns mean_expression, n_nonzero, grna_group, target_type, target, target_gene.
grna_modality |>
get_feature_covariates() |>
head()
#> mean_expression n_nonzero grna_group target_type
#> AAACCGCTCCCGAGCACGGG 0.08524339 1450 SH3BGRL3_TSS gene_tss
#> AAATAGTGGGAAGATTCGTG 0.02863151 556 MTRNR2L8_TSS gene_tss
#> AACACACCACGGAGGAGTGG 0.06421350 987 FAM83A_TSS gene_tss
#> AACAGCCCGGCCGGCCAAGG 0.07196465 1192 ZNF593_TSS gene_tss
#> AACGAGAGACTGCTTGCTGG 0.03283749 693 ATPIF1_TSS gene_tss
#> AACGGCTCGGAAGCCTAGGG 0.07587158 1251 TIPRL_TSS gene_tss
#> target target_gene
#> AAACCGCTCCCGAGCACGGG chr1:26605667-26605668 ENSG00000142669
#> AAATAGTGGGAAGATTCGTG chr11:10530735-10530736 ENSG00000255823
#> AACACACCACGGAGGAGTGG chr8:124191287-124191288 ENSG00000147689
#> AACAGCCCGGCCGGCCAAGG chr1:26496362-26496363 ENSG00000142684
#> AACGAGAGACTGCTTGCTGG chr1:28562620-28562621 ENSG00000130770
#> AACGGCTCGGAAGCCTAGGG chr1:168148171-168148172 ENSG00000143155
gene_modality
#> A covariate_ondisc_matrix with the following components:
#> An integer-valued ondisc_matrix with 12129 features and 206271 cells.
#> A cell covariate matrix with columns n_nonzero, n_umis, p_mito, batch.
#> A feature covariate matrix with columns mean_expression, coef_of_variation, n_nonzero.
Next, the data frame containing the pairs to analyze should contain
columns gene_id
and grna_group
. Additional columns are permitted but
ignored.
# in R
pairs_df <- readRDS(pair_fp)
head(pairs_df)
#> gene_id grna_group
#> 1 ENSG00000238009 random_24
#> 2 ENSG00000237683 random_24
#> 3 ENSG00000228463 random_24
#> 4 ENSG00000237094 random_24
#> 5 ENSG00000235373 random_24
#> 6 ENSG00000228327 random_24
The gene IDs within the gene_id
column should be a subset of the
feature IDs of the gene modality; meanwhile, the grna groups within the
grna_group
column should be a subset of the entries of the
grna_group
column of the feature covariate matrix of the gRNA
modality.
# in R
all(pairs_df$gene_id %in% get_feature_ids(gene_modality)) # gene ID check
#> [1] TRUE
all(pairs_df$grna_group %in% get_feature_covariates(grna_modality)$grna_group) # gRNA group check
#> [1] TRUE
The output file path (result_fp
) specifies the location to write the
results.
The sceptre
pipeline accepts several additional arguments, ordered
here from most important to least important.
formula
: an R formula (stored as a string) specifying the
covariates to adjust for in the analysis. The variables in the
string are assumed to be columns of the global cell covariate
matrix. An example formula is as follows:# in bash
formula="~gene_p_mito+gene_batch+log(gene_n_nonzero)+log(gene_n_umis)+log(grna_n_nonzero)+log(grna_n_umis)"
The variables in this formula (gene_p_mito
, gene_batch
,
gene_n_nonzero
, gene_n_umis
, grna_n_nonzero
, and grna_n_umis
)
are columns of the global cell covariate matrix:
# in R
crispr_experiment |> get_cell_covariates() |> head()
#> gene_n_nonzero gene_n_umis
#> AAACCTGAGAGGTACC-1_1A_1_SI-GA-E2 3549 17566
#> AAACCTGAGTCAATAG-1_1A_1_SI-GA-E2 2543 8917
#> AAACCTGCAAACAACA-1_1A_1_SI-GA-E2 3191 14626
#> AAACCTGCACTTCTGC-1_1A_1_SI-GA-E2 4539 22783
#> AAACCTGCATGTAGTC-1_1A_1_SI-GA-E2 2605 10124
#> AAACCTGGTAGCGCAA-1_1A_1_SI-GA-E2 2187 9743
#> grna_assignment_n_nonzero
#> AAACCTGAGAGGTACC-1_1A_1_SI-GA-E2 68
#> AAACCTGAGTCAATAG-1_1A_1_SI-GA-E2 31
#> AAACCTGCAAACAACA-1_1A_1_SI-GA-E2 63
#> AAACCTGCACTTCTGC-1_1A_1_SI-GA-E2 41
#> AAACCTGCATGTAGTC-1_1A_1_SI-GA-E2 38
#> AAACCTGGTAGCGCAA-1_1A_1_SI-GA-E2 58
#> grna_expression_n_nonzero
#> AAACCTGAGAGGTACC-1_1A_1_SI-GA-E2 84
#> AAACCTGAGTCAATAG-1_1A_1_SI-GA-E2 44
#> AAACCTGCAAACAACA-1_1A_1_SI-GA-E2 82
#> AAACCTGCACTTCTGC-1_1A_1_SI-GA-E2 56
#> AAACCTGCATGTAGTC-1_1A_1_SI-GA-E2 71
#> AAACCTGGTAGCGCAA-1_1A_1_SI-GA-E2 79
#> grna_expression_n_umis batch
#> AAACCTGAGAGGTACC-1_1A_1_SI-GA-E2 994 prep_batch_1
#> AAACCTGAGTCAATAG-1_1A_1_SI-GA-E2 347 prep_batch_1
#> AAACCTGCAAACAACA-1_1A_1_SI-GA-E2 930 prep_batch_1
#> AAACCTGCACTTCTGC-1_1A_1_SI-GA-E2 579 prep_batch_1
#> AAACCTGCATGTAGTC-1_1A_1_SI-GA-E2 1098 prep_batch_1
#> AAACCTGGTAGCGCAA-1_1A_1_SI-GA-E2 1276 prep_batch_1
#> p_mito
#> AAACCTGAGAGGTACC-1_1A_1_SI-GA-E2 0.058786706
#> AAACCTGAGTCAATAG-1_1A_1_SI-GA-E2 0.036086518
#> AAACCTGCAAACAACA-1_1A_1_SI-GA-E2 0.069823051
#> AAACCTGCACTTCTGC-1_1A_1_SI-GA-E2 0.026186508
#> AAACCTGCATGTAGTC-1_1A_1_SI-GA-E2 0.007991318
#> AAACCTGGTAGCGCAA-1_1A_1_SI-GA-E2 0.022356681
The default behavior is to adjust for all (untransformed) variables stored within the global cell covariate matrix.
Note: The formula string should contain no white spaces (e.g., spaces, tabs, etc.).
gene_modality_name
: the name of the gene modality within the
multimodal ondisc matrix (“gene”, in the example above)
grna_modality_name
: the name of the gRNA modality within the
multimodal ondisc matrix (“grna_expression”, in the example above)
threshold
: the threshold to use to assign gRNAs to cells. For a
given cell and gRNA, if the UMI count of the gRNA within the cell is
equal to or greater than the threshold, then the gRNA is taken to be
present in the cell. The default is 3.
B
: the number of resamples to draw for the conditional
randomization test. The default is 1000.
side
: the sidedness of the test, one of “left”, “right”, or
“both”. The default is “both”.
gene_pod_size
, grna_group_pod_size
, and pair_pod_size
:
parameters that control the amount of parallelization. At a high
level the pipeline works as follows: first, all genes are regressed
onto the technical factors; next, all gRNA groups are regressed onto
the technical factors; finally, all pairs of genes and grna groups
(as specified in the pairs data frame) are tested for association.
gene_pod_size
(resp., grna_group_pod_size
) is the number of
genes (resp., gRNA groups) to regress onto the technical factors in
a given Nextflow process. Meanwhile, pair_pod_size
is the number
of gene-gRNA group pairs to test for association in a given Nextflow
process. The default values are 200, 200, and 5000, respectively.
n_pairs_to_sample
: the number of randomly-selected gene-gRNA group
pairs on which to run sceptre
. This parameter is for debugging and
testing purposes; often, it is useful to run the pipeline on (say)
25 randomly-selected pairs to ensure that the pipeline is set up
correctly. The default is to run the pipeline on the entire set of
pairs, i.e., to not subsample at all.
full_output
: “true” or “false” (default “false”); output the
resampled test statistics alongside the p-value, z-score, and log
fold change?
inference_method
: (EXPERIMENTAL) “crt” or “gcm” (default
“crt”); perform inference using the vanilla conditional
randomization test (“crt”) or the faster (but possibly less
accurate) generalized covariance measure (“gcm”)?
An example invocation script is available in the sceptre
Nextflow
repository. Git clone the repository.
# on command line
git clone git@github.com:timothy-barry/sceptre-pipeline.git
cd sceptre-pipeline
The bash script example_launch.sh
contains an example invocation of
the pipeline. We can launch this script via a call to bash
on a
laptop/desktop, qsub
on a cluster running a Sun Gride Engine
scheduler, sbatch
on a cluster running a SLURM scheduler, etc.
# on command line
bash example_launch.sh # laptop
qsub example_launch.sh # sun grid engine
sbatch example_launch.sh # slurm
The results data frame contains columns gene_id
, grna_group
,
p_value
, z_value
, and log_fold_change
. The results can be analyzed
using the tips outlined in Step 7 (“Analyze the Results”) of the
in-memory sceptre
tutorial.