pinellolab / dictys

Context specific and dynamic gene regulatory network reconstruction and analysis
GNU Affero General Public License v3.0
111 stars 14 forks source link

Data input type #23

Closed orbitalse closed 1 year ago

orbitalse commented 1 year ago

Hello,

I am working with a dataset which has ATAC-seq in bigWig format; raw ATAC-seq data was mapped with bowtie2 and peaks were called with MACS2.

I see from the Dictys tutorial that Dictys takes ATAC-seq data in .bam format. I was wondering whether there is a work-around such that Dictys can take bigWig data for ATAC-seq as input? I would prefer not to have to remap all the raw data.

Many thanks.

lingfeiwang commented 1 year ago

Hi orbitalse,

Thank you for the question! If you did not keep the raw output from bowtie2, there is indeed a workaround for cell-type specific GRN inference based on your current pipeline.

  1. Follow any tutorial (e.g. https://github.com/pinellolab/dictys/blob/master/doc/tutorials/short-multiome/notebooks/main.ipynb) and adapt it to your dataset and machine. Do not put in your ATAC-seq data yet.
  2. Expect failures in "Validate input data" step because of that and stop before "Network inference" step.
  3. Create folders tmp_static/$celltype for each $celltype in data/subsets.txt.
  4. Copy your ATAC-seq peak _peaks.narrowPeak bed file to each tmp_static/$celltype/footprints.bed. They can be different for each folder depending on the cell-type specificity of your ATAC-seq data. But it should only contain the strongest (here 100K) peaks based on column 5. This was our peak based (not footprint based) TF binding network approach mentioned in our internal benchmark section, although the peaks were called for each cell type from scATAC-seq data.
  5. In each tmp_static/$celltype folder, run touch names_atac0.txt; touch names_atac.txt; touch reads.bai reads.bam peaks.bed; touch footprints.bed to trick make with modification dates.
  6. Continue running the notebook from the "Network inference" step.

Note the workaround differs from the default approach in Dictys, using peaks instead of footprints to construct TF binding networks and using ATAC-seq data potentially not at the cell type level. I haven't personally tested this workaround so things may fail. In that case, please post any question or error here.

Let me know if it works or if you need a workaround for dynamic GRN inference instead.

Lingfei

orbitalse commented 1 year ago

Hi Lingfei,

Thanks so much for your timely and very informative reply!

I apologize if I misspoke in my original post or introduced confusion by mentioning MACS2. I actually have ATAC-seq data in bigWig format, so from my understanding I think the TF footprint approach would still be viable in this case. I was wondering whether there would be a similar workaround in the Dictys pipeline to use data in bigWig format as input rather than in bam format?

Many thanks.

lingfeiwang commented 1 year ago

Hi orbitalse,

Thanks for the clarifications.

Do you have one bigWig file aggregated across all cells? If so, unfortunately I do not know how to infer cell-type specific TF binding networks from it. It does not contain the cell barcode or cell type of each read, right? Therefore, you cannot extract cell-type specific ATAC-seq reads or infer cell-type specific TF binding networks from the bigWig file.

It also appears challenging if you intend to use TF footprinting but not cell-type specific TF binding networks. We use pyDNase for TF footprinting but it only accepts bam files. If you intend to use a different footprinting software that accepts bigWig files, I am very happy to help.

Otherwise, the above suggestion remains the best workaround to me.

Please let me know how I can help.

Lingfei

orbitalse commented 1 year ago

Hi Lingfei,

Many thanks for your last reply. I apologize for the delay, as I've just had a chance to revisit this issue.

I think I'll start with your original suggestion regarding using ATAC-seq peaks (rather than footprints) to build the GRNs. The ATAC-seq peaks I have are cell type-specific, so they should be able to infer cell type-specific TF binding networks. I'll follow your steps above to adapt the Dictys pipeline for the ATAC-seq peak data and let you know if I have any questions.

In the meantime, the data I am using is from cell sorting (i.e., not single-cell). I have both RNA-seq and ATAC-seq data from a large number (>50) of cell populations. For the RNA-seq data, I have a gene expression matrix where each row is a gene and each column corresponds to one cell type. For the ATAC-seq data, I have the ATAC-seq peaks, one BED file per cell type. I was wondering if there are any other workarounds you would anticipate that I might need to run these data through Dictys?

Thanks again for your help.

lingfeiwang commented 1 year ago

Hi orbitalse,

Your data type turns out quite interesting and different from the common types. Considering that most GRN inference would need a hundred or more samples, you may still infer cell-type specific GRNs but under a restrictive assumption. That is, each gene regulation has a constant strength across cell types but its activeness or activity is determined by cell-type specific TF binding events. In Dictys, we do not assume constant strength gene regulations because Dictys can utilize the statistical power in single-cell datasets to fixate the extra degrees of freedom in the model.

Therefore, although I very appreciate your interest in Dictys, I don't think it provides the most suitable software or model for your dataset. You may want to use a model as in or similar to Pando or TRIPOD, which is $X_{ik}=\sumja{jik}\beta{ji}X{jk}+\varepsilon{ik}$, where $X{ik}$ is gene $i$'s expression in cell type $k$, $a{ji}$ is the cell-type specific activeness (0,1) or activity (0 to 1) of TF $j$ binding to the proximal region of gene $i$ in cell type $k$, $\beta{ji}$ is the regulation strength from TF $j$ to gene $i$ if binding is active, and $\varepsilon_{ik}$ is unexplained noise term. However, I would not suggest to use these softwares directly either because they are designed for single-cell datasets.

As the discussion is getting more and more technical, please send me an email if you want to discuss further.

Lingfei