pinellolab / dictys

Context specific and dynamic gene regulatory network reconstruction and analysis
GNU Affero General Public License v3.0
110 stars 14 forks source link

Integration of bulk ATAC-seq data and scRNAa-seq data: what is the "w" in the "PositionID"? #62

Closed kamisama0101 closed 5 months ago

kamisama0101 commented 5 months ago

Checks before submitting the issue

Describe the error

Optional steps (may accelerate troubleshooting)

At first, I follow the issue #23 (https://github.com/pinellolab/dictys/issues/23) to integrate bulk ATAC-seq data and scRNAa-seq:

  1. Follow any tutorial (e.g. https://github.com/pinellolab/dictys/blob/master/doc/tutorials/short-multiome/notebooks/main.ipynb) and adapt it to your dataset and machine. Do not put in your ATAC-seq data yet.
  2. Expect failures in "Validate input data" step because of that and stop before "Network inference" step.
  3. Create folders tmp_static/$celltype for each $celltype in data/subsets.txt.
  4. Copy your ATAC-seq peak _peaks.narrowPeak bed file to each tmp_static/$celltype/footprint.bed. They can be different for each folder depending on the cell-type specificity of your ATAC-seq data. But it should only contain the strongest (here 100K) peaks based on column. This was our peak based (not footprint based) TF binding network approach mentioned in our internal benchmark section, although the peaks were called for each cell type from scATAC-seq data.
  5. In each tmp_static/$celltype folder, run touch names_atac0.txt; touch names_atac.txt; touch reads.bai reads.bam peaks.bed; touch footprints.bed to trick make with modification dates.
  6. Continue running the notebook from the "Network inference" step. However, I found the empty footprints.bed will leave the process sleep forever because the files in the 15-tfs cannot be generated: (in chromatin_homer.sh) ls -1 14-reform-split | tail -n +2 | while read l; do while ! [ -e 15-tfs/"$l".done ]; do sleep 1 done tail -n +2 15-tfs/"$l" >> 15-tf.bed rm -f 15-tfs/"$l" done

Besides, I didn't identify the footprint.bed file in the corresponding location (tmp_static/Subset*) of the completed short-multiome pipelines. Therefore, I hypothesised that the footprints.bed should not be an empty file and the footprint.bed in step 4 of issue #23 should be footprints.bed, so I modified the codes and copy the footprints.bed from footprint.bed and redid step 5 and 6 of issue #23. However, IndexError: list index out of range appeared in d['chr']=d['PositionID'].apply(lambda x:x.split(':')[1]) in chromatin_homer.py. chromatin_homer.py. I thought this PosistionID was derived from column 4 of the footprints.bed, which is supposed to be the 'peak id','chr','start','end','name','w' seperated by ":", so I tried to reformat column 4. However, I cannot find what is the "w". I guess the "w" is Wellington score generated from Wellington bootstrap or something, but I don't know how to calculate "w" from the bulk ATAC-seq data both biologically and mathematically. Finally, I set all "w" in my footprints.bed as "-1000.0" and reperform the step 5 and 6 in issue #23 . In summary, I have two questions:

  1. Is the footprint.bed in step 4 in issue #23 supposed to be footrprints.bed?
  2. Will the "w" affect the downstream network inference? If so, how can I obtain the "w" from bulk ATAC-seq data
lingfeiwang commented 5 months ago

Hi kamisama0101. Thank you for the finding the error. You are right that it should be footprints.bed. Could you try again by saving your ATAC-seq peak _peaks.narrowPeak file as footprints.bed, but without modifying the code? The code can be interlinked in several places so incomplete changes may lead to errors too.

Lingfei

kamisama0101 commented 5 months ago

I can obtain the files before running the network inference in autoDL platform by cloning the instance. By saving my ATAC-seq peak _peaks.narrowPeak file as footprints.bed, I reperform the network inference. The same Error occured that IndexError: list index out of range appeared in d['chr']=d['PositionID'].apply(lambda x:x.split(':')[1]). It seems that I still need "w" values in my footprints.bed.

lingfeiwang commented 5 months ago

Issue resolved outside github.