morris-lab / CellOracle

This is the alpha version of the CellOracle package
Other
291 stars 49 forks source link

tfinfo and GRN files #191

Open momur opened 3 months ago

momur commented 3 months ago

Hello dev team,

Thanks for the amazing tool. I would like to understand the TFinfo and Base GRN files a little bit better.

The TF info file looks like this: What are these factors_direct and factors_indirect columns exactly? Are these the motifs found in the distal elements from co-accessibility analysis? Screenshot 2024-03-26 at 6 08 09 PM

and GRN looks like this. The gene_short_name is the annotation of the peak_id in the TF info file, right? Screenshot 2024-03-26 at 6 08 19 PM

Thanks!

KenjiKamimoto-wustl122 commented 3 months ago

Hi @momur ,

Thank you for trying celloracle. In short, the difference between factors_indirect and factors_direct is based on their information source. The factors_direct is based on the experimentally confirmed motif, while the factors_indirect is picked up based on relatively indirect evidence or computational inference. For more information, please look at the explanation in the motif database. http://cisbp.ccbr.utoronto.ca/faq.html and https://gimmemotifs.readthedocs.io/en/master/index.html

The binding site is shown in the seqname or peak_id column. Some of the elements are distal, and some are proximal.

As you pointed out, the gene_short_name is an annotation of the peak_id. For example, peak "chr10_100009210_100010306" is a cis-regulatory element of the gene DNMBP.

momur commented 3 months ago

Hi,

Thank you for your reply. It helps.

My motivation here is to know which motifs are found in the promoter and enhancer regions. We provide co-accessible peaks, and celloracle performs TF motif scanning in the co-accessible sites. TF info is created after the TF motif scanning step. I thought that direct factors are the motifs found in the co-accessible sites. Is there a way to retrieve the information that I am looking for from any celloracle outputs?

Thanks!

KenjiKamimoto-wustl122 commented 3 months ago

@momur

You can distinguish promoter peaks and other distal regulatory element peaks as follows.

In the peak data preprocessing step, peak annotation was already done. https://morris-lab.github.io/CellOracle.documentation/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.html#3.-Integrate-TSS-info-and-cicero-connections

Screen Shot 2024-04-01 at 10 10 39 PM

In this dataframe, integrated, the promoter peaks have a co-accessible score of 1. If the co-accessible score is less than 1, the peaks do not contain TSS. So, you can distinguish promoters from enhancers by looking at the score. I think this is the information you are looking for.

momur commented 2 months ago

Hi @KenjiKamimoto-wustl122 ,

Thanks for the explanation. It helps but I would like to rephrase my question to make it simpler.

Based on this file, I subset tfinfo for the gene called EBF1 (seqname as show in the picture). I would like to know what is the relationship between the seqname (in this case EBF1 gene) and the motifs in the factors_direct column (e.g., ATF2, CREB1). Do these motifs found regulatory region of EBF1? How should I interpret it? Screenshot 2024-04-05 at 11 52 23 AM

I hope that it makes it clear now. Thanks!