gimme scan running quite time

shangguandong1996 commented 3 years ago

Hi, dear developer

I am running gimme scan using my bed and genome, but it takes so much time. and it not finished.

$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p /opt/sysoft/Python-3.7.9/lib/python3.7/site-packages/data/motif_databases/JASPAR2018_plants.pfm test.bed > test.motif
2021-06-14 10:45:48,025 - INFO - determining FPR-based threshold

Here is my bed, which has only 1000 line

$ head test.bed 
Chr1    1422    3756    CIM_SIM_ATAC_1  .   .
Chr1    6494    6860    CIM_SIM_ATAC_2  .   .
Chr1    8527    8978    CIM_SIM_ATAC_3  .   .
Chr1    9381    9877    CIM_SIM_ATAC_4  .   .
Chr1    13375   15027   CIM_SIM_ATAC_5  .   .
Chr1    15874   16066   CIM_SIM_ATAC_6  .   .
Chr1    16542   16929   CIM_SIM_ATAC_7  .   .
Chr1    20621   21789   CIM_SIM_ATAC_8  .   .
Chr1    22150   23178   CIM_SIM_ATAC_9  .   .
Chr1    32994   34002   CIM_SIM_ATAC_10 .   .

Here is my genome, whose size about 100MB

$ head ~/reference/genome/TAIR10/Athaliana.fa
>Chr1
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCT
TGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTA
CATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTA
TGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATT
TAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAG
CATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAA
AAAAGCTATCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACTCAAAAAAGTATTTTTAGATGT

But if I use the test data, it only takes less than 1 min

$ gimme scan Gm12878.CTCF.top500.w200.bed -g ~/reference/genome/hg19/hg19/hg19.fa > Gm12878.CTCF.top500.w200.motif

$ head Gm12878.CTCF.top500.w200.motif 
# GimmeMotifs version 0.16.0
# Input: Gm12878.CTCF.top500.w200.bed
# Motifs: /opt/sysoft/Python-3.7.9/lib/python3.7/site-packages/gimmemotifs/../data/motif_databases/gimme.vertebrate.v5.0.pfm
# FPR: 0.01 (/home/sgd/reference/genome/hg19/hg19/hg19.fa)
# Scoring: logodds score
chr11:190037-190237 pfmscan misc_feature    143 159 13.564392316467963  +   .   motif_name "GM.5.0.C2H2_ZF.0023" ; motif_instance "acgccccctgctggcag"
chr11:190037-190237 pfmscan misc_feature    143 156 11.987579910503003  +   .   motif_name "GM.5.0.C2H2_ZF.0024" ; motif_instance "acgccccctgctgg"
chr11:190037-190237 pfmscan misc_feature    28  40  7.964216719769734   -   .   motif_name "GM.5.0.C2H2_ZF.0028" ; motif_instance "cctgctggcaacc"
chr11:190037-190237 pfmscan misc_feature    144 156 13.411970806500104  +   .   motif_name "GM.5.0.C2H2_ZF.0037" ; motif_instance "cgccccctgctgg"
chr11:190037-190237 pfmscan misc_feature    144 156 8.375953426690378   -   .   motif_name "GM.5.0.Mixed.0025" ; motif_instance "cgccccctgctgg"

Best wishes

Guandong Shang

simonvh commented 3 years ago

Hi Guandong. We are aware that scanning may take some time. This is especially the case for longer sequences. We're working on improving this, but it will not be available in the short term, The test data has sequences of ~200bp, it seems your example has longer sequences. If you are analyzing ATAC-seq peaks (just guessing from your input), the best approach is to take 200bp regions centered at the summit of your peaks. This will be faster, but the motif results will also be much better.

shangguandong1996 commented 3 years ago

Thanks ! It works after resize my peak file. By the way, I highly recommend writing below sentence in manul :).

resizing peak file will make the scan works faster

shangguandong1996 commented 3 years ago

Hi, I just have another issue, which likes the below issue https://github.com/vanheeringen-lab/gimmemotifs/issues/152

Here is my whole peak set, I have resized 500bp

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head WT_CIM_SIM_Resize_MergePeak.bed 
Chr1    2875    3375    CIM_SIM_ATAC_resize_1   .   .
Chr1    6390    6890    CIM_SIM_ATAC_resize_2   .   .
Chr1    8490    8990    CIM_SIM_ATAC_resize_3   .   .
Chr1    9333    9833    CIM_SIM_ATAC_resize_4   .   .
Chr1    13981   14481   CIM_SIM_ATAC_resize_5   .   .
Chr1    15748   16248   CIM_SIM_ATAC_resize_6   .   .
Chr1    16494   16994   CIM_SIM_ATAC_resize_7   .   .
Chr1    20837   21337   CIM_SIM_ATAC_resize_8   .   .
Chr1    22684   23184   CIM_SIM_ATAC_resize_9   .   .
Chr1    33263   33763   CIM_SIM_ATAC_resize_10  .   .

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ wc -l WT_CIM_SIM_Resize_MergePeak.bed 
29804 WT_CIM_SIM_Resize_MergePeak.bed

Then I extract first 1000 line to do scan, which works well

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head -n 1000 WT_CIM_SIM_Resize_MergePeak.bed > test.bed

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p ~/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme test.bed > test.motif

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head test.motif 
# GimmeMotifs version 0.16.0
# Input: test.bed
# Motifs: /home/sgd/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme
# FPR: 0.01 (/home/sgd/reference/genome/TAIR10/Athaliana.fa)
# Scoring: logodds score
Chr1:2875-3375 CIM_SIM_ATAC_resize_1    pfmscan misc_feature    108 114 5.817207239414311   +   .   motif_name "MA0982.1_DOF2.4" ; motif_instance "AAAAAGT"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1    pfmscan misc_feature    211 222 8.87118934508003    -   .   motif_name "MA1044.1_NAC92" ; motif_instance "TTTGGCGTGTTC"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1    pfmscan misc_feature    272 290 7.842209471388505   +   .   motif_name "MA1062.2_TCP15" ; motif_instance "TTGGGAGGGACCCATTATT"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1    pfmscan misc_feature    272 290 7.86289776912883    +   .   motif_name "MA1065.2_TCP20" ; motif_instance "TTGGGAGGGACCCATTATT"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1    pfmscan misc_feature    110 119 8.55527391791114    +   .   motif_name "MA1089.1_WRKY57" ; motif_instance "AAAGTCAACC"

But If I use the whole peak set, it do not work well

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p ~/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme WT_CIM_SIM_Resize_MergePeak.bed > WT_CIM_SIM_Resize_MergePeak.motif

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ cat WT_CIM_SIM_Resize_MergePeak.motif 
# GimmeMotifs version 0.16.0
# Input: WT_CIM_SIM_Resize_MergePeak.bed
# Motifs: /home/sgd/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme
# FPR: 0.01 (/home/sgd/reference/genome/TAIR10/Athaliana.fa)
# Scoring: logodds score

simonvh commented 3 years ago

How big is the peak set?

shangguandong1996 commented 3 years ago

Hi, Simon. My total line is

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ wc -l WT_CIM_SIM_Resize_MergePeak.bed 
29804 WT_CIM_SIM_Resize_MergePeak.bed

But I find some weird things. If I extract 20000 line, it can output motif but sometimes will stop in some location

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head -n 20000 WT_CIM_SIM_Resize_MergePeak.bed > test.bed 

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p JASPAR2018_plants test.bed -b -N 50

# and it will stop in these line when outputing motif
Chr1    11013665    11013673    MA0930.1_ABF3   6.513131829858493   +
Chr1    11013747    11013761    MA1012.1_AGL27  13.35310311316117   -
Chr1    11013670    11013678    MA1057.1_SPL12  7.5510905528271115  +
Chr1    11019046    11019054    MA1042.1_MYB59  8.971458346922029   -

And if I extract 20001 line, it can not work

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head -n 20001 WT_CIM_SIM_Resize_MergePeak.bed > test.bed 

sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p JASPAR2018_plants test.bed -b -N 50
# GimmeMotifs version 0.16.0
# Input: test.bed
# Motifs: JASPAR2018_plants
# FPR: 0.01 (/home/sgd/reference/genome/TAIR10/Athaliana.fa)
# Scoring: logodds score

I am wondering whether you can test my bed file.

It seems that github do not allow update bed file, so I rename it into txt WT_CIM_SIM_Resize_MergePeak.txt

JASPAR2018_plants is in gimmemotifs database

you can download TAIR10 genome in https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas but you have to manually modify chrosome name like these

sgd@localhost ~/reference/genome/TAIR10
$ grep ">" Athaliana.fa
>Chr1
>Chr2
>Chr3
>Chr4
>Chr5
>ChrM
>ChrC

simonvh commented 3 years ago

Hi @shangguandong1996 , finally getting back to you. It turns out that the TAIR10 genome contains non-ACTG characters. This caused your issue (the scanning fails, but never reports back and the program hangs). Can you try the fix?

Run the following in your environment and try again:

pip install git+https://github.com/vanheeringen-lab/gimmemotifs.git@763bf23

shangguandong1996 commented 3 years ago

Thanks @simonvh It works :)

vanheeringen-lab / gimmemotifs

gimme scan running quite time #193