Open momur opened 3 months ago
Hi @momur ,
Thank you for trying celloracle.
In short, the difference between factors_indirect and factors_direct is based on their information source.
The factors_direct
is based on the experimentally confirmed motif, while the factors_indirect
is picked up based on relatively indirect evidence or computational inference.
For more information, please look at the explanation in the motif database. http://cisbp.ccbr.utoronto.ca/faq.html
and https://gimmemotifs.readthedocs.io/en/master/index.html
The binding site is shown in the seqname
or peak_id
column. Some of the elements are distal, and some are proximal.
As you pointed out, the gene_short_name is an annotation of the peak_id. For example, peak "chr10_100009210_100010306" is a cis-regulatory element of the gene DNMBP.
Hi,
Thank you for your reply. It helps.
My motivation here is to know which motifs are found in the promoter and enhancer regions. We provide co-accessible peaks, and celloracle performs TF motif scanning in the co-accessible sites. TF info is created after the TF motif scanning step. I thought that direct factors are the motifs found in the co-accessible sites. Is there a way to retrieve the information that I am looking for from any celloracle outputs?
Thanks!
@momur
You can distinguish promoter peaks and other distal regulatory element peaks as follows.
In the peak data preprocessing step, peak annotation was already done. https://morris-lab.github.io/CellOracle.documentation/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.html#3.-Integrate-TSS-info-and-cicero-connections
In this dataframe, integrated
, the promoter peaks have a co-accessible score of 1. If the co-accessible score is less than 1, the peaks do not contain TSS. So, you can distinguish promoters from enhancers by looking at the score. I think this is the information you are looking for.
Hi @KenjiKamimoto-wustl122 ,
Thanks for the explanation. It helps but I would like to rephrase my question to make it simpler.
Based on this file, I subset tfinfo for the gene called EBF1 (seqname as show in the picture). I would like to know what is the relationship between the seqname (in this case EBF1 gene) and the motifs in the factors_direct column (e.g., ATF2, CREB1). Do these motifs found regulatory region of EBF1? How should I interpret it?
I hope that it makes it clear now. Thanks!
Hello dev team,
Thanks for the amazing tool. I would like to understand the TFinfo and Base GRN files a little bit better.
The TF info file looks like this: What are these factors_direct and factors_indirect columns exactly? Are these the motifs found in the distal elements from co-accessibility analysis?![Screenshot 2024-03-26 at 6 08 09 PM](https://github.com/morris-lab/CellOracle/assets/83424513/aa8e8f83-15b4-41af-a448-8428bdb25782)
and GRN looks like this. The gene_short_name is the annotation of the peak_id in the TF info file, right?![Screenshot 2024-03-26 at 6 08 19 PM](https://github.com/morris-lab/CellOracle/assets/83424513/8d9ba63f-9dc6-439a-8e88-ceee55b0207d)
Thanks!