Closed twang15 closed 2 years ago
SCG is too slow for installation, so I have to move to my own laptop
Conda and pyenv both have system installations, although I have installed my own pyenv. To avoid conflicts between Conda and pyenv, I disabled my own pyenv installation. But I am not sure whether the system installation will conflict yet.
Here we go!
Reference: https://github.com/kundajelab/basepairmodels
conda create --name basepairmodels python=3.6.8
conda activate basepairmodels
pip install git+https://github.com/kundajelab/basepairmodels.git@v0.1.4
Unfortunately, there will an error complaining
ERROR: Could not find a version that satisfies the requirement tensorflow-gpu==1.14 (from basepairmodels) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0)
ERROR: No matching distribution found for tensorflow-gpu==1.14
And my laptop GPU does not support tensor flow-gpu 1.14
So I move back to SCG
I have uninstalled previous conda installation because it does not work for unknown reason.
To reinstall conda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
I expected conda to be set up properly, but it did not happen. So I manually added conda to the path: export PATH=/oak/stanford/scg/prj_ENCODE/miniconda3/bin:$PATH
conda create --name basepairmodels python=3.6.8
conda activate basepairmodels
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
powershell
source ~/conda_setup # cannot remember how this file was generated before, maybe produced by Anaconda not miniconda installation
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. ipykernel 5.2.0 requires jupyter-client, which is not installed. ipykernel 5.2.0 requires tornado>=4.2, which is not installed. awscli 1.18.29 requires PyYAML<5.4,>=3.10; python_version != "3.4", but you have pyyaml 5.4.1 which is incompatible.
pip install jupyter-client
pip install tornado>=4.2
pip uninstall pyyaml
pip install "pyyaml>=3.10,<5.4"
pip install git+https://github.com/kundajelab/basepairmodels.git@v0.1.4
Reference: https://github.com/kundajelab/basepairmodels
manually download https://www.encodeproject.org/files/GRCh38_EBV.chrom.sizes/ and copy it to SCG
cp GRCh38_EBV.chrom.sizes hg38.chrom.sizes
Note: another way to get hg38.chrom.sizes file from UCSC: If you are referring to the '-g' flag for bedrolls you can simply use the 'fetchChromSizes' script (link is to a 64bit Linux binary). fetchChromSizes hg19 > hg19.chrom.sizes
wget https://www.encodeproject.org/files/ENCFF198CVB/@@download/ENCFF198CVB.bam -O rep1.bam
wget https://www.encodeproject.org/files/ENCFF488CXC/@@download/ENCFF488CXC.bam -O rep2.bam
wget https://www.encodeproject.org/files/ENCFF023NGN/@@download/ENCFF023NGN.bam -O control.bam
samtools merge -f merged.bam rep1.bam rep2.bam
samtools index merged.bam
bedtools genomecov -5 -bg -strand + -g hg38.chrom.sizes -ibam merged.bam | sort -k1,1 -k2,2n > plus.bedGraph
bedtools genomecov -5 -bg -strand - -g hg38.chrom.sizes -ibam merged.bam | sort -k1,1 -k2,2n > minus.bedGraph
bedGraphToBigWig plus.bedGraph hg38.chrom.sizes plus.bw
bedGraphToBigWig minus.bedGraph hg38.chrom.sizes minus.bw
bedtools genomecov -5 -bg -strand + -g hg38.chrom.sizes -ibam control.bam | sort -k1,1 -k2,2n > control_plus.bedGraph
bedtools genomecov -5 -bg -strand - -g hg38.chrom.sizes -ibam control.bam | sort -k1,1 -k2,2n > control_minus.bedGraph
bedGraphToBigWig control_plus.bedGraph hg38.chrom.sizes control_plus.bw
bedGraphToBigWig control_minus.bedGraph hg38.chrom.sizes control_minus.bw
wget https://www.encodeproject.org/files/ENCFF396BZQ/@@download/ENCFF396BZQ.bed.gz
gzip -d ENCFF396BZQ.bed.gz
cp ENCFF396BZQ.bed peaks.bed
tree ENCSR000EGM/data/
ENCSR000EGM/data/ ├── control_minus.bw ├── control_plus.bw ├── minus.bw ├── peaks.bed └── plus.bw
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gzip -d hg38.fa.gz
module load samtools
samtools faidx hg38.fa
mv hg38.fa hg38.genome.fa
mv hg38.fa.fai hg38.genome.fai
echo "chr1\nchr2" >> chroms.txt
Before we start training, we need to compile a json file that contains information about the input data. Here is a sample json file that shows how to specify the input data information for the data we organized in Section 1.3. The data is organized into tracks. In this example we have two tracks, the plus and the minus strand. Each track has 4 required keys strand, task_id, signal, & peaks and one optional key control, which can be specified if control data is available. Notice how the task_id remains the same for the two tracks. We use the same task_id for the plus & minus pairs of the same experiment, and use strand to disinguish between the two, 0 for plus strand and 1 for the minus strand.
{
"task0_plus": {
"strand": 0,
"task_id": 0,
"signal": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/plus.bw",
"control": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/control_plus.bw",
"peaks": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/peaks.bed"},
"task0_minus": {
"strand": 1,
"task_id": 0,
"signal": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/minus.bw",
"control": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/control_minus.bw",
"peaks": "/oak/stanford/scg/prj_ENCODE/ChIP/basepairmodel/basepairmodels/tutorial/ENCSR000EGM/data/peaks.bed"}
}
splits.json file contains information about the chromosomes that are used for validation and test. Here is a sample that contains one split.
{
"0": {
"val": ["chr10", "chr8"],
"test": ["chr1"]
}
}
The command to train a model is called train.
You can kick start the training process by executing this command in your shell
BASE_DIR=~/ENCSR000EGM
DATA_DIR=$BASE_DIR/data
MODEL_DIR=$BASE_DIR/models
REFERENCE_DIR=~/reference
CHROM_SIZES=$REFERENCE_DIR/hg38.chrom.sizes
REFERENCE_GENOME=$REFERENCE_DIR/hg38.genome.fa
CV_SPLITS=$BASE_DIR/splits.json
INPUT_DATA=$BASE_DIR/input_data.json
mkdir $MODEL_DIR
train \
--input-data $INPUT_DATA \
--stranded \
--has-control \
--output-dir $MODEL_DIR \
--reference-genome $REFERENCE_GENOME \
--chroms $(paste -s -d ' ' $REFERENCE_DIR/hg38_chroms.txt) \
--chrom-sizes $CHROM_SIZES \
--splits $CV_SPLITS \
--model-arch-name BPNet1000d8 \
--sequence-generator-name BPNet \
--model-output-filename model \
--input-seq-len 2114 \
--output-len 1000 \
--filters 64 \
--shuffle \
--threads 10 \
--epochs 100 \
--learning-rate 0.004
this training command gets instantiated as: train --input-data ENCSR000EGM/input_data.json --stranded --has-control --output-dir ENCSR000EGM/models --reference-genome reference/hg38.genome.fa --chroms chr1 chr2 --chrom-sizes reference/hg38.chrom.sizes --splits ENCSR000EGM/splits.json --model-arch-name BPNet1000d8 --sequence-generator-name BPNet --model-output-filename model --input-seq-len 2114 --output-len 1000 --filters 64 --shuffle --threads 10 --epochs 100 --learning-rate 0.004
source ~/conda_setup conda env list
# conda environments:
#
base * /oak/stanford/scg/prj_ENCODE/miniconda3
basepairmodels /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels
conda activate basepairmodels
pip install pudb
pudb /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/train --input-data ENCSR000EGM/input_data.json --stranded --has-control --output-dir ENCSR000EGM/models --reference-genome reference/hg38.genome.fa --chroms chr1 chr2 --chrom-sizes reference/hg38.chrom.sizes --splits ENCSR000EGM/splits.json --model-arch-name BPNet1000d8 --sequence-generator-name BPNet --model-output-filename model --input-seq-len 2114 --output-len 1000 --filters 64 --shuffle --threads 10 --epochs 100 --learning-rate 0.004
encounter this error message, and fix it by adding one line:
either line 7 (spec = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)") or line 8 (__spec = None) in /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/train
I was able to dump the contents of spec and assign them to a variable that was included inside main to allow this code to function within the IPython console.
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
with Pool(5) as p:
print (p.map(f, [1, 2, 3]))
pudb /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/predict --model ENCSR000EGM/models/model_split000.h5 --chrom-sizes reference/hg38.chrom.sizes --chroms chr1 --reference-genome reference/hg38.genome.fa --exponentiate-counts --output-dir ENCSR000EGM/predictions --input-data ENCSR000EGM/input_data.json --predict-peaks --write-buffer-size 2000 --batch-size 1 --has-control --stranded --input-seq-len 2114 --output-len 1000 --output-window-size 1000
Use puDB to Debug Python Multiprocessing Code | by Auro Tripathy | Medium.pdf
You can easily do that with some conditional logic show below to selectively break into a particular process
if worker_id == 1: # debug process with id 1 set_trace()
Once the condition is met in the process you want to break in, puDB will look for a free port and wait for a telnet connection:
pudb:6899: Please telnet into 127.0.0.1 6899. pudb:6899: Waiting for client...
from multiprocessing import Process
def f(name):
# breakpoint was introduced in Python 3.7
# breakpoint()
set_trace() # more general
print('hello', name)
p = Process(target=f, args=('bob',))
p.start()
p.join()
Method 1. on terminal A: PYTHONBREAKPOINT=pudb.forked.set_trace python script.py
pudb:6899: Please telnet into 127.0.0.1 6899.
pudb:6899: Waiting for client...
on terminal B: telnet 127.0.0.1 6899
to enter remote pudb
Method 2. on terminal A: pudb script.py
Method 1 is more general, it is the same as above approach, which could selectively break into one of multiple forked processes.
srun --nodes=1 --ntasks=1 --cpus-per-task=10 --partition=interactive --account=default --time=4:00:00 --pty /bin/bash
source ~/conda_setup conda activate basepairmodels cd /home/taowang9/ChIP/basepairmodel/basepairmodels/tutorial time /oak/stanford/scg/prj_ENCODE/miniconda3/envs/basepairmodels/bin/predict --model ENCSR000EGM/models/model_split000.h5 --chrom-sizes reference/hg38.chrom.sizes --chroms chr1 --reference-genome reference/hg38.genome.fa --exponentiate-counts --output-dir ENCSR000EGM/predictions --input-data ENCSR000EGM/input_data.json --predict-peaks --write-buffer-size 2000 --batch-size 1 --has-control --stranded --input-seq-len 2114 --output-len 1000 --output-window-size 1000
To compute metrics first compute the min max bounds for each peak individually, and on both the plus and minus strands. This can be done by a single command as follows:
METRICS_DIR=$BASE_DIR/metrics
mkdir $METRICS_DIR
BOUNDS_DIR=$BASE_DIR/bounds
mkdir $METRICS_DIR
bounds \
--input-profiles $DATA_DIR/plus.bw $DATA_DIR/minus.bw \
--output-names task0_plus task0_minus \
--output-directory $BOUNDS_DIR \
--peaks $DATA_DIR/peaks.bed \
--chroms chr1
instantiated as:
bounds --input-profiles ENCSR000EGM/data/plus.bw ENCSR000EGM/data/minus.bw --output-names task0_plus task0_minus --output-directory ENCSR000EGM/bounds --peaks ENCSR000EGM/data/peaks.bed --chroms chr1
Now compute metrics on the plus and minus strand separately using the following command
metrics \
-A [path to profile training bigwig] \
-B [path to profile predictions bigwig] \
--peaks $DATA_DIR/peaks.bed \
--chroms chr1 \
--output-dir $METRICS_DIR \
--apply-softmax-to-profileB \
--countsB [path to exponentiated counts predictions bigWig] \
--chrom-sizes $CHROM_SIZES
Instantiated for Plus strand and for minus strand separately: metrics -A ENCSR000EGM/data/plus.bw -B ENCSR000EGM/predictions/model_split000_task0_plus.bw --peaks ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/metrics --apply-softmax-to-profileB --countsB ENCSR000EGM/predictions/model_split000_task0_plus_exponentiated_counts.bw --chrom-sizes reference/hg38.chrom.sizes
metrics -A ENCSR000EGM/data/minus.bw -B ENCSR000EGM/predictions/model_split000_task0_minus.bw --peaks ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/metrics --apply-softmax-to-profileB --countsB ENCSR000EGM/predictions/model_split000_task0_minus_exponentiated_counts.bw --chrom-sizes reference/hg38.chrom.sizes
SHAP_DIR=$BASE_DIR/shap mkdir $SHAP_DIR shap_scores \ --reference-genome $REFERENCE_GENOME \ --model $(ls ${MODEL_DIR}/INSERT-DIRECTORY-NAME-HERE/*.h5) \ --bed-file $DATA_DIR/peaks.bed \ --chroms chr1 \ --output-dir $SHAP_DIR \ --input-seq-len 2114 \ --control-len 1000 \ --control-smoothing 7.0 81 \ --task-id 0 \ --control-info $INPUT_DATA
Instantiated as:
shap_scores --reference-genome reference/hg38.genome.fa --model ENCSR000EGM/models/model_split000.h5 --bed-file ENCSR000EGM/data/peaks.bed --chroms chr1 --output-dir ENCSR000EGM/shap --input-seq-len 2114 --control-len 1000 --control-smoothing 7.0 81 --task-id 0 --control-info ENCSR000EGM/input_data.json
MODISCO_PROFILE_DIR=$BASE_DIR/modisco_profile
mkdir $MODISCO_PROFILE_DIR
motif_discovery \
---scores-path $INTERPRET_DIR/
MODISCO_COUNTS_DIR=$BASE_DIR/modisco_counts
mkdir $MODISCO_COUNTS_DIR
motif_discovery \
---scores-path $INTERPRET_DIR/
Instantiated as:
motif_discovery ---scores-path ENCSR000EGM/shap/profile_scores.h5 --output-directory ENCSR000EGM/modisco_profile motif_discovery --scores-path ENCSR000EGM/shap/counts_scores.h5 --output-directory ENCSR000EGM/modisco_counts
Continue exploration of two models
bpnet: https://github.com/kundajelab/bpnet bpnet tutorial: https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD#scrollTo=k_tFgH1y-dq- basepairmodels: https://github.com/kundajelab/basepairmodels
bpnet
ModuleNotFoundError in line 4 of /oak/stanford/scg/prj_ENCODE/ChIP/BPNet/bpnet/bpnet/examples/Snakefile: No module named 'bpnet' File "/oak/stanford/scg/prj_ENCODE/ChIP/BPNet/bpnet/bpnet/examples/Snakefile", line 4, in
basepairmodel