Integration of bulk ATAC-seq data and scRNAa-seq data: what is the "w" in the "PositionID"?

kamisama0101 commented 5 months ago

Checks before submitting the issue

[√] I installed the latest Dictys version successfully
[√] I downloaded the whole folder for the tutorial
[√] I ran the tutorial cells and notebooks (if more than one) in the correct order and only once
[√] The preceeding cells and notebooks finished successfully (i.e. the reported error is the first encountered)

Describe the error

How did you install Dictys (which option)?
Which tutorial?
What and where was the error?
Did you deviate from any of the tutorial instructions?

Optional steps (may accelerate troubleshooting)

At first, I follow the issue #23 (https://github.com/pinellolab/dictys/issues/23) to integrate bulk ATAC-seq data and scRNAa-seq:

Follow any tutorial (e.g. https://github.com/pinellolab/dictys/blob/master/doc/tutorials/short-multiome/notebooks/main.ipynb) and adapt it to your dataset and machine. Do not put in your ATAC-seq data yet.
Expect failures in "Validate input data" step because of that and stop before "Network inference" step.
Create folders tmp_static/$celltype for each $celltype in data/subsets.txt.
Copy your ATAC-seq peak _peaks.narrowPeak bed file to each tmp_static/$celltype/footprint.bed. They can be different for each folder depending on the cell-type specificity of your ATAC-seq data. But it should only contain the strongest (here 100K) peaks based on column. This was our peak based (not footprint based) TF binding network approach mentioned in our internal benchmark section, although the peaks were called for each cell type from scATAC-seq data.
In each tmp_static/$celltype folder, run touch names_atac0.txt; touch names_atac.txt; touch reads.bai reads.bam peaks.bed; touch footprints.bed to trick make with modification dates.
Continue running the notebook from the "Network inference" step. However, I found the empty footprints.bed will leave the process sleep forever because the files in the 15-tfs cannot be generated: (in chromatin_homer.sh) ls -1 14-reform-split | tail -n +2 | while read l; do while ! [ -e 15-tfs/"$l".done ]; do sleep 1 done tail -n +2 15-tfs/"$l" >> 15-tf.bed rm -f 15-tfs/"$l" done

Besides, I didn't identify the footprint.bed file in the corresponding location (tmp_static/Subset*) of the completed short-multiome pipelines. Therefore, I hypothesised that the footprints.bed should not be an empty file and the footprint.bed in step 4 of issue #23 should be footprints.bed, so I modified the codes and copy the footprints.bed from footprint.bed and redid step 5 and 6 of issue #23. However, IndexError: list index out of range appeared in d['chr']=d['PositionID'].apply(lambda x:x.split(':')[1]) in chromatin_homer.py. chromatin_homer.py. I thought this PosistionID was derived from column 4 of the footprints.bed, which is supposed to be the 'peak id','chr','start','end','name','w' seperated by ":", so I tried to reformat column 4. However, I cannot find what is the "w". I guess the "w" is Wellington score generated from Wellington bootstrap or something, but I don't know how to calculate "w" from the bulk ATAC-seq data both biologically and mathematically. Finally, I set all "w" in my footprints.bed as "-1000.0" and reperform the step 5 and 6 in issue #23 . In summary, I have two questions:

Is the footprint.bed in step 4 in issue #23 supposed to be footrprints.bed?
Will the "w" affect the downstream network inference? If so, how can I obtain the "w" from bulk ATAC-seq data

lingfeiwang commented 5 months ago

Hi kamisama0101. Thank you for the finding the error. You are right that it should be footprints.bed. Could you try again by saving your ATAC-seq peak _peaks.narrowPeak file as footprints.bed, but without modifying the code? The code can be interlinked in several places so incomplete changes may lead to errors too.

Lingfei

kamisama0101 commented 5 months ago

I can obtain the files before running the network inference in autoDL platform by cloning the instance. By saving my ATAC-seq peak _peaks.narrowPeak file as footprints.bed, I reperform the network inference. The same Error occured that IndexError: list index out of range appeared in d['chr']=d['PositionID'].apply(lambda x:x.split(':')[1]). It seems that I still need "w" values in my footprints.bed.

lingfeiwang commented 5 months ago

Issue resolved outside github.

pinellolab / dictys

Integration of bulk ATAC-seq data and scRNAa-seq data: what is the "w" in the "PositionID"? #62