Closed shangguandong1996 closed 1 year ago
Hi Guandong. We are aware that scanning may take some time. This is especially the case for longer sequences. We're working on improving this, but it will not be available in the short term, The test data has sequences of ~200bp, it seems your example has longer sequences. If you are analyzing ATAC-seq peaks (just guessing from your input), the best approach is to take 200bp regions centered at the summit of your peaks. This will be faster, but the motif results will also be much better.
Thanks ! It works after resize my peak file. By the way, I highly recommend writing below sentence in manul :).
resizing peak file will make the scan works faster
Hi, I just have another issue, which likes the below issue https://github.com/vanheeringen-lab/gimmemotifs/issues/152
Here is my whole peak set, I have resized 500bp
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head WT_CIM_SIM_Resize_MergePeak.bed
Chr1 2875 3375 CIM_SIM_ATAC_resize_1 . .
Chr1 6390 6890 CIM_SIM_ATAC_resize_2 . .
Chr1 8490 8990 CIM_SIM_ATAC_resize_3 . .
Chr1 9333 9833 CIM_SIM_ATAC_resize_4 . .
Chr1 13981 14481 CIM_SIM_ATAC_resize_5 . .
Chr1 15748 16248 CIM_SIM_ATAC_resize_6 . .
Chr1 16494 16994 CIM_SIM_ATAC_resize_7 . .
Chr1 20837 21337 CIM_SIM_ATAC_resize_8 . .
Chr1 22684 23184 CIM_SIM_ATAC_resize_9 . .
Chr1 33263 33763 CIM_SIM_ATAC_resize_10 . .
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ wc -l WT_CIM_SIM_Resize_MergePeak.bed
29804 WT_CIM_SIM_Resize_MergePeak.bed
Then I extract first 1000 line to do scan, which works well
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head -n 1000 WT_CIM_SIM_Resize_MergePeak.bed > test.bed
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p ~/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme test.bed > test.motif
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head test.motif
# GimmeMotifs version 0.16.0
# Input: test.bed
# Motifs: /home/sgd/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme
# FPR: 0.01 (/home/sgd/reference/genome/TAIR10/Athaliana.fa)
# Scoring: logodds score
Chr1:2875-3375 CIM_SIM_ATAC_resize_1 pfmscan misc_feature 108 114 5.817207239414311 + . motif_name "MA0982.1_DOF2.4" ; motif_instance "AAAAAGT"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1 pfmscan misc_feature 211 222 8.87118934508003 - . motif_name "MA1044.1_NAC92" ; motif_instance "TTTGGCGTGTTC"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1 pfmscan misc_feature 272 290 7.842209471388505 + . motif_name "MA1062.2_TCP15" ; motif_instance "TTGGGAGGGACCCATTATT"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1 pfmscan misc_feature 272 290 7.86289776912883 + . motif_name "MA1065.2_TCP20" ; motif_instance "TTGGGAGGGACCCATTATT"
Chr1:2875-3375 CIM_SIM_ATAC_resize_1 pfmscan misc_feature 110 119 8.55527391791114 + . motif_name "MA1089.1_WRKY57" ; motif_instance "AAAGTCAACC"
But If I use the whole peak set, it do not work well
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p ~/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme WT_CIM_SIM_Resize_MergePeak.bed > WT_CIM_SIM_Resize_MergePeak.motif
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ cat WT_CIM_SIM_Resize_MergePeak.motif
# GimmeMotifs version 0.16.0
# Input: WT_CIM_SIM_Resize_MergePeak.bed
# Motifs: /home/sgd/reference/annoation/Athaliana/motif/JASPAR2020_joined_motifs.meme
# FPR: 0.01 (/home/sgd/reference/genome/TAIR10/Athaliana.fa)
# Scoring: logodds score
How big is the peak set?
Hi, Simon. My total line is
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ wc -l WT_CIM_SIM_Resize_MergePeak.bed
29804 WT_CIM_SIM_Resize_MergePeak.bed
But I find some weird things. If I extract 20000 line, it can output motif but sometimes will stop in some location
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head -n 20000 WT_CIM_SIM_Resize_MergePeak.bed > test.bed
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p JASPAR2018_plants test.bed -b -N 50
# and it will stop in these line when outputing motif
Chr1 11013665 11013673 MA0930.1_ABF3 6.513131829858493 +
Chr1 11013747 11013761 MA1012.1_AGL27 13.35310311316117 -
Chr1 11013670 11013678 MA1057.1_SPL12 7.5510905528271115 +
Chr1 11019046 11019054 MA1042.1_MYB59 8.971458346922029 -
And if I extract 20001 line, it can not work
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ head -n 20001 WT_CIM_SIM_Resize_MergePeak.bed > test.bed
sgd@localhost ~/project/202005/WLY_Total_Fig_202005/ATAC_CIM_SIM/result/gimmeMotif
$ gimme scan -g ~/reference/genome/TAIR10/Athaliana.fa -p JASPAR2018_plants test.bed -b -N 50
# GimmeMotifs version 0.16.0
# Input: test.bed
# Motifs: JASPAR2018_plants
# FPR: 0.01 (/home/sgd/reference/genome/TAIR10/Athaliana.fa)
# Scoring: logodds score
I am wondering whether you can test my bed file.
It seems that github do not allow update bed file, so I rename it into txt WT_CIM_SIM_Resize_MergePeak.txt
JASPAR2018_plants is in gimmemotifs database
you can download TAIR10 genome in https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas but you have to manually modify chrosome name like these
sgd@localhost ~/reference/genome/TAIR10
$ grep ">" Athaliana.fa
>Chr1
>Chr2
>Chr3
>Chr4
>Chr5
>ChrM
>ChrC
Hi @shangguandong1996 , finally getting back to you. It turns out that the TAIR10 genome contains non-ACTG characters. This caused your issue (the scanning fails, but never reports back and the program hangs). Can you try the fix?
Run the following in your environment and try again:
pip install git+https://github.com/vanheeringen-lab/gimmemotifs.git@763bf23
Thanks @simonvh It works :)
Hi, dear developer
I am running gimme scan using my bed and genome, but it takes so much time. and it not finished.
Here is my bed, which has only 1000 line
Here is my genome, whose size about 100MB
But if I use the test data, it only takes less than 1 min
Best wishes
Guandong Shang