timothy-barry / ondisc

Space- and time-optimal algorithms for large single-cell expression matrices, with a focus on single-cell CRISPR screens.
https://timothy-barry.github.io/ondisc/
Other
11 stars 5 forks source link

“read_odm” function has been removed from current version ? #26

Closed L1angyan closed 7 months ago

L1angyan commented 7 months ago

Hi Tim, I downloaded the processed import-gasperini-2019-v2 data from the dropbox link you shared in sceptre2-manuscript. I want to generated a sceptre object using the code in import-gasperini-2019-v2.

However, when I ran gene_odm <- read_odm(paste0(processed_gene_dir, "matrix.odm"), paste0(processed_gene_dir, "metadata.rds"))

I got an error: Error in read_odm(paste0(processed_gene_dir, "matrix.odm"), paste0(processed_gene_dir, : could not find function "read_odm"

Is the read_odm funtion has been deleted? How I can get the sceptre object from these input ?

>tree ./gasperini-2019-v2/
gasperini-2019-v2/
└── at-scale
    └── processed
        ├── gene
        │   ├── matrix.odm
        │   └── metadata.rds
        ├── grna_assignment
        │   ├── grna_assignment.zip
        │   ├── matrix.odm
        │   └── metadata.rds
        ├── grna_expression
        │   ├── grna_expression.zip
        │   ├── matrix.odm
        │   └── metadata.rds
        └── pairs_ungrouped.rds

5 directories, 9 files

Best Yan

timothy-barry commented 7 months ago

Hi Yan,

Do you mind telling me a bit more about the analysis you are trying to run? For example, are you seeking to deploy the sceptre pipeline to analyze the Gasperini 2019 data?

The data in the Dropbox repo (which are associated with the "Robust DE testing..." preprint) are in ondisc version 1.1.0 format. You can download ondisc version 1.1.0 via install_github("timothy-barry/ondisc", ref = "v1.1"). This should allow you to access the Gasperini data stored in the Dropbox repo.

However, note that ondisc version 1.1.0 is not compatible with the sceptre Nextflow pipeline. Instead, the Nextflow pipeline requires ondisc version 1.2.0 (i.e., the current version of ondisc). I can supply you with the Gasperini data stored in ondisc version 1.2.0 if that would be helpful.

Tim

L1angyan commented 7 months ago

Hi, Tim

Thank you for the prompt reply.

Definitely, I am intersted in some enhancers in K562 cell line and seeking to deploy the sceptre pipeline to analyze the Gasperini 2019 data. The Gasperini data stored in ondisc version 1.2.0 you said are the input file mentioned in import-gasperini-2019-v3 repos ?

└── at-scale
    ├── processed
    │   ├── grna.odm
    │   ├── response.odm
    │   ├── sceptre_object.rds
    │   └── sceptre_object_in_memory.rds

If yes, Gasperini data stored in ondisc version 1.2.0 are exactly what I need. You will share me a dropbox link, right ? I am so grateful for your help !!!

Yan

timothy-barry commented 7 months ago

Hi Yan,

Got it. We were planning to do a trans analysis of the Gasperini data as part of our current project. I think I will get to this in about a week or two, probably. Perhaps I could send you the data and preliminary results at that time?

Tim

L1angyan commented 7 months ago

Hi Yan,

Got it. We were planning to do a trans analysis of the Gasperini data as part of our current project. I think I will get to this in about a week or two, probably. Perhaps I could send you the data and preliminary results at that time?

Tim

Thank you Tim. I am looking forward to your results.

Yan

L1angyan commented 7 months ago

Hi Tim, I followed the import-gasperini-2019-v3 pipeline to generate these files:

(sceptre) [liangyan@mu01 at-scale]$ tree processed/ 1.sceptre_res/
processed/
├── grna.odm
├── pairs_grouped.rds
├── response.odm
├── sceptre_object_mem.rds
└── sceptre_object.rds
1.sceptre_res/
├── 1.runSCEPTRE_pipeline.sh
├── 1.sceptre_outputs
│   ├── analysis_summary.txt
│   ├── grna_assignment_matrix.rds
│   ├── plot_assign_grnas.png
│   ├── plot_grna_count_distributions.png
│   ├── plot_run_calibration_check.png
│   ├── plot_run_discovery_analysis.png
│   ├── plot_run_power_check.png
│   ├── plot_run_qc.png
│   ├── results_run_calibration_check.rds
│   ├── results_run_discovery_analysis.rds
│   └── results_run_power_check.rds
├── 1.sceptre.R
├── 1.sgRNA_DEanalysis.csv
├── 1.sgRNA_info.csv
├── 302669.err
└── 302669.out

Interestingly, I found that only on chromosome 9, there are some siginicant enhancer-gene pairs which is whose distance is much greater than 500kb. For example, the UBAC1 gene located far away form the two enhancer. It seems strange, isn't it? image

However, the GSE120861_gene_gRNAgroup_pair_table.at_scale.txt do not inclue the enhancer-gene pairs.

timothy-barry commented 7 months ago

Hi Lyon,

Could you provide some details about the analysis that you ran? For example, did you run a cis analysis or trans analysis (I'm guessing the former)? Were you using the ondisc-backed sceptre_object or the standard sceptre_object? And did you use the R console interface or the Nextflow pipeline?

I am not sure about the enhancer located at ~chr9:135879214, but my suspicion is that this enhancer regulates a nearby transcription factor, which in turn regulates UBAC1.

By the way, I plan to run a trans analysis on the Gasperini data today. Perhaps we could discuss more when I obtain those results.

L1angyan commented 7 months ago

I used the ondisc-backed sceptre_object just in the R console interface. Actually, I run an cis-analysis, thus I do not think for enhancer located at ~chr9:135879214, the UBAC1 gene should be tested.

By the way, the gene-sgRNA pairs table GSE120861_gene_gRNAgroup_pair_table.at_scale.txt do not include the ~chr9:135879214 enhancer-UBAC1 pairs.

This is the the code I run, following get started with sceptre.

library(sceptre)
library(dplyr)
library(conflicted)
conflict_prefer("filter", "dplyr")

gasp_offsite <- paste0("/public/home/liunangroup/liangyan/project/tf-assymmetry/20240415_CreType-diff/CRISPR_screen/import-gasperini-2019-v3/",
                       "at-scale/")
processed_data_dir <- paste0(gasp_offsite, "processed/")

sceptre_object <- read_ondisc_backed_sceptre_object(
  sceptre_object_fp = paste0(processed_data_dir, "sceptre_object.rds"),
  response_odm_file_fp = paste0(processed_data_dir, "response.odm"),
  grna_odm_file_fp = paste0(processed_data_dir, "grna.odm"))
sceptre_object

write.csv(sceptre_object@grna_target_data_frame,'1.sgRNA_info.csv',row.names=F,col.names=T)

positive_control_pairs <- construct_positive_control_pairs(sceptre_object)
discovery_pairs <- construct_cis_pairs(
  sceptre_object, 
  positive_control_pairs = positive_control_pairs)

# apply the pipeline functions to the sceptre_object in order
sceptre_object <- sceptre_object |> # |> is R's base pipe, similar to %>%
  set_analysis_parameters(discovery_pairs, positive_control_pairs,resampling_mechanism="permutations") |>
  run_calibration_check(parallel = TRUE) |>
  run_power_check(parallel = TRUE) |>
  run_discovery_analysis(parallel = TRUE)

# write the results to disk
write_outputs_to_directory(sceptre_object, "1.sceptre_outputs")
thread='32'
sbatch --job-name=job --partition=cu --nodes=1 --ntasks-per-node=$thread --output=%j.out --error=%j.err --wrap="
/public/home/liunangroup/liangyan/software/miniconda3/envs/sceptre/bin/Rscript 1.sceptre.R
"
L1angyan commented 7 months ago

I guess I find the reason. sceptre using coordinate of gene position in hg38 reference genome. However, the results of gasperini-2019 based on hg19 genome.

Actually, the TSS of UBAC1 in hg38 is Chromosome 9: 135,932,969-135,961,373, locating at the cis-region of ~chr9:135879214

?construct_cis_pairs

#Usage:
#
#     construct_cis_pairs(
#       sceptre_object,
#       positive_control_pairs = data.frame(),
#       distance_threshold = 500000L,
#       response_position_data_frame = gene_position_data_frame_grch38)
timothy-barry commented 7 months ago

Ah, I see. The sceptre high-MOI example uses gene_position_data_frame_grch38 in construct_cis_pairs(), but really it should be using gene_position_data_frame_grch19. 😳 I will add gene_position_data_frame_grch19 to the package and fix the example. Thank you for noticing this!

By the way, I ran a trans analysis of the Gasperini data using the sceptre pipeline. Note that the trans analysis pairs all gRNAs to all genes; thus, the issue with the construct_cis_pairs() function is not relevant here. You can download the results here: https://www.dropbox.com/scl/fo/q2sasfxtxzprsfazyjbdm/AOjk2GDhhyjzdZcdpk2QuTs?rlkey=2drsij7e1l2wytaaqfxg0ro6h&st=5jvgi08f&dl=0. Note that these results are preliminary, and we will officially release these results as part of our upcoming manuscript. The book provides instructions for interacting with the .parquet files.

timothy-barry commented 7 months ago

Oh, I should add that I randomly grouped the NT gRNAs into groups of size two for this analysis. I then tested each NT gRNA group against the entire set of genes. The NT gRNA group names begin with "nt_".

L1angyan commented 7 months ago

Hi Tim, Thank you for your help. Looking forward to your new work.

By the way, I guess the purpose of grouping NT sgRNAs is acquiring some baseline effect of negtive control for hypothesis testing, so I do not need to pay attention to the groups begin with "nt_", right?

timothy-barry commented 7 months ago

Yes, that is right, you can ignore the gRNA targets that begin with "nt". (However, if you want to convince yourself that sceptre is doing the right thing on these data --- i.e., controlling type-I error --- then you may want to look at the "nt" gRNA targets).

timothy-barry commented 7 months ago

(Note that sceptre does not use the NT gRNAs to test hypotheses or compute p-values; rather, sceptre uses the NT gRNAs to verify control of type-I error/FDR.)