BPNet tutorial-2 - Githubissues

twang15 commented 2 years ago

Continue exploration of two models

bpnet: https://github.com/kundajelab/bpnet bpnet tutorial: https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD#scrollTo=k_tFgH1y-dq- basepairmodels: https://github.com/kundajelab/basepairmodels

bpnet

snakemake -d bpnet/examples -s bpnet/examples/Snakefile chip_nexus -j 4 --quiet

ModuleNotFoundError in line 4 of /oak/stanford/scg/prj_ENCODE/ChIP/BPNet/bpnet/bpnet/examples/Snakefile: No module named 'bpnet' File "/oak/stanford/scg/prj_ENCODE/ChIP/BPNet/bpnet/bpnet/examples/Snakefile", line 4, in

basepairmodel

twang15 commented 2 years ago

basepairmodel installation on my laptop

SCG is too slow for installation, so I have to move to my own laptop

Conda and pyenv both have system installations, although I have installed my own pyenv. To avoid conflicts between Conda and pyenv, I disabled my own pyenv installation. But I am not sure whether the system installation will conflict yet.

Here we go!

Reference: https://github.com/kundajelab/basepairmodels

conda create --name basepairmodels python=3.6.8
conda activate basepairmodels
pip install git+https://github.com/kundajelab/basepairmodels.git@v0.1.4

Unfortunately, there will an error complaining

ERROR: Could not find a version that satisfies the requirement tensorflow-gpu==1.14 (from basepairmodels) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0)
ERROR: No matching distribution found for tensorflow-gpu==1.14

And my laptop GPU does not support tensor flow-gpu 1.14

So I move back to SCG

twang15 commented 2 years ago

I have uninstalled previous conda installation because it does not work for unknown reason.

To reinstall conda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh

I expected conda to be set up properly, but it did not happen. So I manually added conda to the path: export PATH=/oak/stanford/scg/prj_ENCODE/miniconda3/bin:$PATH

conda create --name basepairmodels python=3.6.8
conda activate basepairmodels

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. To initialize your shell, run

$ conda init <SHELL_NAME>

Currently supported shells are:

bash
fish
tcsh
xonsh
zsh
powershell

source ~/conda_setup # cannot remember how this file was generated before, maybe produced by Anaconda not miniconda installation

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. ipykernel 5.2.0 requires jupyter-client, which is not installed. ipykernel 5.2.0 requires tornado>=4.2, which is not installed. awscli 1.18.29 requires PyYAML<5.4,>=3.10; python_version != "3.4", but you have pyyaml 5.4.1 which is incompatible.

pip install jupyter-client
pip install tornado>=4.2
pip uninstall pyyaml
pip install "pyyaml>=3.10,<5.4"
pip install git+https://github.com/kundajelab/basepairmodels.git@v0.1.4

twang15 commented 2 years ago

download this file: hg38.chrom.sizes

Reference: https://github.com/kundajelab/basepairmodels

manually download https://www.encodeproject.org/files/GRCh38_EBV.chrom.sizes/ and copy it to SCG
cp GRCh38_EBV.chrom.sizes hg38.chrom.sizes

Note: another way to get hg38.chrom.sizes file from UCSC: If you are referring to the '-g' flag for bedrolls you can simply use the 'fetchChromSizes' script (link is to a 64bit Linux binary). fetchChromSizes hg19 > hg19.chrom.sizes

wget https://www.encodeproject.org/files/ENCFF198CVB/@@download/ENCFF198CVB.bam -O rep1.bam
wget https://www.encodeproject.org/files/ENCFF488CXC/@@download/ENCFF488CXC.bam -O rep2.bam
wget https://www.encodeproject.org/files/ENCFF023NGN/@@download/ENCFF023NGN.bam -O control.bam
samtools merge -f merged.bam rep1.bam rep2.bam
samtools index merged.bam

bedtools genomecov -5 -bg -strand + -g hg38.chrom.sizes -ibam merged.bam | sort -k1,1 -k2,2n > plus.bedGraph
bedtools genomecov -5 -bg -strand - -g hg38.chrom.sizes -ibam merged.bam | sort -k1,1 -k2,2n > minus.bedGraph
bedGraphToBigWig plus.bedGraph hg38.chrom.sizes plus.bw
bedGraphToBigWig minus.bedGraph hg38.chrom.sizes minus.bw

bedtools genomecov -5 -bg -strand +  -g hg38.chrom.sizes -ibam control.bam  | sort -k1,1 -k2,2n > control_plus.bedGraph
bedtools genomecov -5 -bg -strand -  -g hg38.chrom.sizes -ibam control.bam   | sort -k1,1 -k2,2n > control_minus.bedGraph
bedGraphToBigWig control_plus.bedGraph hg38.chrom.sizes control_plus.bw
bedGraphToBigWig control_minus.bedGraph hg38.chrom.sizes control_minus.bw

twang15 commented 2 years ago

wget https://www.encodeproject.org/files/ENCFF396BZQ/@@download/ENCFF396BZQ.bed.gz
gzip -d ENCFF396BZQ.bed.gz
cp ENCFF396BZQ.bed peaks.bed
tree ENCSR000EGM/data/

ENCSR000EGM/data/ ├── control_minus.bw ├── control_plus.bw ├── minus.bw ├── peaks.bed └── plus.bw

twang15 commented 2 years ago

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gzip -d hg38.fa.gz
module load samtools
samtools faidx hg38.fa
mv hg38.fa hg38.genome.fa
mv hg38.fa.fai hg38.genome.fai

chroms.txt is just file with a list of all chromosomes that you are interested in.

echo "chr1\nchr2" >> chroms.txt

twang15 commented 2 years ago

Prepare input_data.json

Before we start training, we need to compile a json file that contains information about the input data. Here is a sample json file that shows how to specify the input data information for the data we organized in Section 1.3. The data is organized into tracks. In this example we have two tracks, the plus and the minus strand. Each track has 4 required keys strand, task_id, signal, & peaks and one optional key control, which can be specified if control data is available. Notice how the task_id remains the same for the two tracks. We use the same task_id for the plus & minus pairs of the same experiment, and use strand to disinguish between the two, 0 for plus strand and 1 for the minus strand.

  {
      "task0_plus": {
          "strand": 0,
          "task_id": 0,
          "signal": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/plus.bw",
          "control": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/control_plus.bw",
          "peaks": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/peaks.bed"},

      "task0_minus": {
          "strand": 1,
          "task_id": 0,
          "signal": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/minus.bw",
          "control": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/control_minus.bw",
          "peaks": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/peaks.bed"}
  }

Prepare split.json

splits.json file contains information about the chromosomes that are used for validation and test. Here is a sample that contains one split.

  {
      "0": {
          "val": ["chr10", "chr8"], 
          "test": ["chr1"]
      }
  }

Train the model

The command to train a model is called train.

You can kick start the training process by executing this command in your shell

BASE_DIR=~/ENCSR000EGM
DATA_DIR=$BASE_DIR/data
MODEL_DIR=$BASE_DIR/models
REFERENCE_DIR=~/reference
CHROM_SIZES=$REFERENCE_DIR/hg38.chrom.sizes
REFERENCE_GENOME=$REFERENCE_DIR/hg38.genome.fa
CV_SPLITS=$BASE_DIR/splits.json
INPUT_DATA=$BASE_DIR/input_data.json

mkdir $MODEL_DIR
train \
    --input-data $INPUT_DATA \
    --stranded \
    --has-control \
    --output-dir $MODEL_DIR \
    --reference-genome $REFERENCE_GENOME \
    --chroms $(paste -s -d ' ' $REFERENCE_DIR/hg38_chroms.txt) \
    --chrom-sizes $CHROM_SIZES \
    --splits $CV_SPLITS \
    --model-arch-name BPNet1000d8 \
    --sequence-generator-name BPNet \
    --model-output-filename model \
    --input-seq-len 2114 \
    --output-len 1000 \
    --filters 64 \
    --shuffle \
    --threads 10 \
    --epochs 100 \
    --learning-rate 0.004

this training command gets instantiated as: train --input-data ENCSR000EGM/input_data.json --stranded --has-control --output-dir ENCSR000EGM/models --reference-genome reference/hg38.genome.fa --chroms chr1 chr2 --chrom-sizes reference/hg38.chrom.sizes --splits ENCSR000EGM/splits.json --model-arch-name BPNet1000d8 --sequence-generator-name BPNet --model-output-filename model --input-seq-len 2114 --output-len 1000 --filters 64 --shuffle --threads 10 --epochs 100 --learning-rate 0.004

twang15 commented 2 years ago

enable conda and activate basepairmodel enviroment

source ~/conda_setup conda env list

# conda environments:
#
base                  *  /oak/stanford/scg/prj_ENCODE/miniconda3
basepairmodels           /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels

conda activate basepairmodels

install pudb for debugging

pip install pudb

use pudb to inspect the source code

pudb /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/train --input-data ENCSR000EGM/input_data.json --stranded --has-control --output-dir ENCSR000EGM/models --reference-genome reference/hg38.genome.fa --chroms chr1 chr2 --chrom-sizes reference/hg38.chrom.sizes --splits ENCSR000EGM/splits.json --model-arch-name BPNet1000d8 --sequence-generator-name BPNet --model-output-filename model --input-seq-len 2114 --output-len 1000 --filters 64 --shuffle --threads 10 --epochs 100 --learning-rate 0.004

twang15 commented 2 years ago

multiprocessing — Manage Processes Like Threads

Multiprocessing: https://pymotw.com/3/multiprocessing/index.html
Multi-threading: https://pymotw.com/3/threading/index.html
fork, spawn, forkserver: https://stackoverflow.com/questions/64095876/multiprocessing-fork-vs-spawn

Python Multiprocessing error: AttributeError: module 'main' has no attribute 'spec'

encounter this error message, and fix it by adding one line:

either line 7 (spec = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)") or line 8 (__spec = None) in /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/train

Reference: https://stackoverflow.com/questions/45720153/python-multiprocessing-error-attributeerror-module-main-has-no-attribute

  I was able to dump the contents of spec and assign them to a variable that was included inside main to allow this code to function within the IPython console.

  from multiprocessing import Pool

  def f(x):
      return x*x

  if __name__ == '__main__':
      __spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
      with Pool(5) as p:
         print (p.map(f, [1, 2, 3]))

twang15 commented 2 years ago

Predict

pudb /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/predict --model ENCSR000EGM/models/model_split000.h5 --chrom-sizes reference/hg38.chrom.sizes --chroms chr1 --reference-genome reference/hg38.genome.fa --exponentiate-counts --output-dir ENCSR000EGM/predictions --input-data ENCSR000EGM/input_data.json --predict-peaks --write-buffer-size 2000 --batch-size 1 --has-control --stranded --input-seq-len 2114 --output-len 1000 --output-window-size 1000

twang15 commented 2 years ago

How to debug multiprocessing program with pudb

Use puDB to Debug Python Multiprocessing Code | by Auro Tripathy | Medium.pdf
- add "from pudb.remote import set_trace" to the program main module (entry model)
- add "set_trace()" to where you want to breakpoint. Without parameters, this will maximize remote pudb window (telnet window)
- You can easily do that with some conditional logic show below to selectively break into a particular process
  
  if worker_id == 1: # debug process with id 1 set_trace()
- Once the condition is met in the process you want to break in, puDB will look for a free port and wait for a telnet connection:
  
  pudb:6899: Please telnet into 127.0.0.1 6899. pudb:6899: Waiting for client...
- Go to another terminal, type, telnet 127.0.0.1 6899 and expect to see the console taken over by the visual puDB debugger.

Using the debugger after forking

  from multiprocessing import Process
  def f(name):
      # breakpoint was introduced in Python 3.7
      # breakpoint()
      set_trace() # more general
      print('hello', name)

  p = Process(target=f, args=('bob',))
  p.start()
  p.join()

Method 1. on terminal A: PYTHONBREAKPOINT=pudb.forked.set_trace python script.py

  pudb:6899: Please telnet into 127.0.0.1 6899.
  pudb:6899: Waiting for client...

on terminal B: telnet 127.0.0.1 6899

to enter remote pudb

Method 2. on terminal A: pudb script.py

Method 1 is more general, it is the same as above approach, which could selectively break into one of multiple forked processes.

twang15 commented 2 years ago

To avoid GPU memory problem:

srun --nodes=1 --ntasks=1 --cpus-per-task=10 --partition=interactive --account=default --time=4:00:00 --pty /bin/bash

source ~/conda_setup conda activate basepairmodels cd /home/taowang9/ChIP/basepairmodel/basepairmodels/tutorial time /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/predict --model ENCSR000EGM/models/model_split000.h5 --chrom-sizes reference/hg38.chrom.sizes --chroms chr1 --reference-genome reference/hg38.genome.fa --exponentiate-counts --output-dir ENCSR000EGM/predictions --input-data ENCSR000EGM/input_data.json --predict-peaks --write-buffer-size 2000 --batch-size 1 --has-control --stranded --input-seq-len 2114 --output-len 1000 --output-window-size 1000

twang15 commented 2 years ago

Compute metrics

To compute metrics first compute the min max bounds for each peak individually, and on both the plus and minus strands. This can be done by a single command as follows:

METRICS_DIR=$BASE_DIR/metrics
mkdir $METRICS_DIR
BOUNDS_DIR=$BASE_DIR/bounds
mkdir $METRICS_DIR
bounds \
    --input-profiles $DATA_DIR/plus.bw $DATA_DIR/minus.bw \
    --output-names task0_plus task0_minus \
    --output-directory $BOUNDS_DIR \
    --peaks $DATA_DIR/peaks.bed \
    --chroms chr1

instantiated as:

bounds --input-profiles ENCSR000EGM/data/plus.bw ENCSR000EGM/data/minus.bw --output-names task0_plus task0_minus --output-directory ENCSR000EGM/bounds --peaks ENCSR000EGM/data/peaks.bed --chroms chr1

Now compute metrics on the plus and minus strand separately using the following command

metrics \
   -A [path to profile training bigwig] \
   -B [path to profile predictions bigwig] \
   --peaks $DATA_DIR/peaks.bed \
   --chroms chr1 \
   --output-dir $METRICS_DIR \
   --apply-softmax-to-profileB \
   --countsB [path to exponentiated counts predictions bigWig] \
   --chrom-sizes $CHROM_SIZES

Instantiated for Plus strand and for minus strand separately: metrics -A ENCSR000EGM/data/plus.bw -B ENCSR000EGM/predictions/model_split000_task0_plus.bw --peaks ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/metrics --apply-softmax-to-profileB --countsB ENCSR000EGM/predictions/model_split000_task0_plus_exponentiated_counts.bw --chrom-sizes reference/hg38.chrom.sizes

metrics -A ENCSR000EGM/data/minus.bw -B ENCSR000EGM/predictions/model_split000_task0_minus.bw --peaks ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/metrics --apply-softmax-to-profileB --countsB ENCSR000EGM/predictions/model_split000_task0_minus_exponentiated_counts.bw --chrom-sizes reference/hg38.chrom.sizes

twang15 commented 2 years ago

Compute importance score

SHAP_DIR=$BASE_DIR/shap mkdir $SHAP_DIR shap_scores \ --reference-genome $REFERENCE_GENOME \ --model $(ls ${MODEL_DIR}/INSERT-DIRECTORY-NAME-HERE/*.h5) \ --bed-file $DATA_DIR/peaks.bed \ --chroms chr1 \ --output-dir $SHAP_DIR \ --input-seq-len 2114 \ --control-len 1000 \ --control-smoothing 7.0 81 \ --task-id 0 \ --control-info $INPUT_DATA

Instantiated as:

shap_scores --reference-genome reference/hg38.genome.fa --model ENCSR000EGM/models/model_split000.h5 --bed-file ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/shap --input-seq-len 2114 --control-len 1000 --control-smoothing 7.0 81 --task-id 0 --control-info ENCSR000EGM/input_data.json

Discover motifs with TF-modisco

MODISCO_PROFILE_DIR=$BASE_DIR/modisco_profile mkdir $MODISCO_PROFILE_DIR motif_discovery \ ---scores-path $INTERPRET_DIR/ \ --output-directory $MODISCO_PROFILE_DIR

MODISCO_COUNTS_DIR=$BASE_DIR/modisco_counts mkdir $MODISCO_COUNTS_DIR motif_discovery \ ---scores-path $INTERPRET_DIR/ \ --output-directory $MODISCO_COUNTS_DIR

Instantiated as:

motif_discovery ---scores-path ENCSR000EGM/shap/profile_scores.h5 --output-directory ENCSR000EGM/modisco_profile motif_discovery --scores-path ENCSR000EGM/shap/counts_scores.h5 --output-directory ENCSR000EGM/modisco_counts

twang15 / K562-Analysis

BPNet tutorial-2 #16

Continue exploration of two models

bpnet

basepairmodel

basepairmodel installation on my laptop

download this file: hg38.chrom.sizes

chroms.txt is just file with a list of all chromosomes that you are interested in.

Prepare input_data.json

Prepare split.json

Train the model

enable conda and activate basepairmodel enviroment

install pudb for debugging

use pudb to inspect the source code

multiprocessing — Manage Processes Like Threads

Python Multiprocessing error: AttributeError: module 'main' has no attribute 'spec'

Predict

How to debug multiprocessing program with pudb

Using the debugger after forking

To avoid GPU memory problem:

Compute metrics

Compute importance score

Discover motifs with TF-modisco