Closed connorrogerson closed 3 years ago
Tried it with a meme format motif database and receive this error:
TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/
# TOBIAS 0.12.1 BINDetect (run started 2021-06-24 13:10:49.881592)
# Working directory: /rds/user/cjr78/hpc-work/ATAC/tobias
# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/
# ----- Input parameters -----
# signals: ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks: /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs: ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme']
# genome: /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names: ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header: None
# naming: name_id
# motif_pvalue: 0.0001
# bound_pvalue: 0.001
# pseudo: None
# time_series: False
# skip_excel: False
# output_peaks: None
# prefix: bindetect
# outdir: /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores: 1
# split: 100
# verbosity: 3
# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf
2021-06-24 13:10:49 (213198) [INFO] ----- Processing input data -----
2021-06-24 13:10:49 (213198) [INFO] Checking reading/writing of files
2021-06-24 13:10:50 (213198) [INFO] Reading peaks
2021-06-24 13:10:50 (213198) [INFO] - Found 1213 regions in input peaks
2021-06-24 13:10:50 (213198) [INFO] - Merged to 1213 regions
2021-06-24 13:10:50 (213198) [INFO] Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 13:10:50 (213198) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 13:10:50 (213198) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 13:10:50 (213198) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 13:10:50 (213198) [INFO] Estimating GC content from peak sequences
2021-06-24 13:10:57 (213198) [INFO] - GC content estimated at 51.22%
2021-06-24 13:10:57 (213198) [INFO] Reading motifs from file
Traceback (most recent call last):
File "/home/cjr78/miniconda3/envs/seq/bin/TOBIAS", line 33, in <module>
sys.exit(load_entry_point('tobias==0.12.1', 'console_scripts', 'TOBIAS')())
File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/TOBIAS.py", line 154, in main
args.func(args)
File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/tools/bindetect.py", line 212, in run_bindetect
motif_list += MotifList().from_file(f) #List of OneMotif objects
File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/utils/motifs.py", line 201, in from_file
self[-1].n = int(info["nsites"]) #overwrites OneMotif default
ValueError: invalid literal for int() with base 10: '1000.0'
Hi, thank you for this issue! The first errors with gimme.vertebrate.v5.0.tobias.pfm
arise because BINDetect expects either MEME or JASPAR-style motif file. I am not sure what this format is:
MA0006.1_Ahr::Arnt
0.1250 0.3333 0.0834 0.4583
0.0000 0.0000 0.9582 0.0417
0.0000 0.9582 0.0000 0.0417
0.0000 0.0000 0.9582 0.0417
0.0000 0.0000 0.0000 0.9999
0.0000 0.0000 0.9999 0.0000
I would love to add the option to read more formats, but I haven't been able to find a good source for the file format convention. Sometimes .pfm
looks like the above, sometimes it looks like:
SIX5_disc1 SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic
G 0.008511 0.004255 0.987234 0.000000
A 0.902127 0.012766 0.038298 0.046809
R 0.455319 0.072340 0.344681 0.127660
W 0.251064 0.085106 0.085106 0.578724
T 0.000000 0.046809 0.012766 0.940425
G 0.000000 0.000000 1.000000 0.000000
T 0.038298 0.021277 0.029787 0.910638
A 0.944681 0.004255 0.051064 0.000000
G 0.000000 0.000000 1.000000 0.000000
T 0.000000 0.000000 0.012766 0.987234
So that is unfortunately a bit hard....
But for the error with the .meme file, this is definitely a bug from my side :-) In the letter-probability matrix-line, you probably have a float number nsites:
letter-probability matrix: nsites= 1000.0
vs. letter-probability matrix: nsites= 1000
And the error arises because this can't be directly transformed into int. So changing nsites to an integer will also clear the error. I have made a small fix on the development branch (https://github.com/loosolab/TOBIAS/commit/e5251cc61aa5eec184558e51ea96c5a753f0b943), so you might also fetch and install that. I like to collect a few bugfixes before I release a new version, so I can't say when it will be officially 'out'. But I hope this helps you out!
Mette
Hi I'm getting another error unfortunately.
This is my meme file:
MEME version 3.0
ALPHABET= ACGT
strands: + -
Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF GM.5.0.Sox.0001
letter-probability matrix: alength= 4 w= 6 nsites= 1000.0 E= 0
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.003 0.003 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.005 0.0168 0.8299
MOTIF GM.5.0.Homeodomain.0001
letter-probability matrix: alength= 4 w= 11 nsites= 999.9 E= 0
0.887 0 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.917 0.014 0.0046 0.0644
0.2228 0.2421 0.33 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428
MOTIF GM.5.0.Mixed.0001
letter-probability matrix: alength= 4 w= 11 nsites= 1000.1 E= 0
0.0061 0.331 0.6435 0.0195
0.155 0.3038 0.3952 0.146
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.005 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.369 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.256 0.584 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061
The run output is: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/
2021-06-25 16:50:29 (127630) [INFO] ----- Processing input data ----- 2021-06-25 16:50:29 (127630) [INFO] Checking reading/writing of files 2021-06-25 16:50:29 (127630) [INFO] Reading peaks 2021-06-25 16:50:29 (127630) [INFO] - Found 1213 regions in input peaks 2021-06-25 16:50:29 (127630) [INFO] - Merged to 1213 regions 2021-06-25 16:50:29 (127630) [INFO] Checking for match between --peaks and --fasta/--signals boundaries 2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa 2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw 2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw 2021-06-25 16:50:29 (127630) [INFO] Estimating GC content from peak sequences 2021-06-25 16:50:30 (127630) [INFO] - GC content estimated at 51.22% 2021-06-25 16:50:30 (127630) [INFO] Reading motifs from file ERROR: No matrix could be read for motif 'GM.5.0.Sox.0001 ' - please check the format of the input motif file.
Hi Connor,
This was a small oversight from my end - I falsely assumed there would always be at least one line between two motifs, e.g.:
MEME version 3.0
ALPHABET= ACGT
strands: + -
Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF GM.5.0.Sox.0001
letter-probability matrix: alength= 4 w= 6 nsites= 1000.0 E= 0
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.003 0.003 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.005 0.0168 0.8299
MOTIF GM.5.0.Homeodomain.0001
letter-probability matrix: alength= 4 w= 11 nsites= 999.9 E= 0
0.887 0 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.917 0.014 0.0046 0.0644
0.2228 0.2421 0.33 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428
MOTIF GM.5.0.Mixed.0001
letter-probability matrix: alength= 4 w= 11 nsites= 1000.1 E= 0
0.0061 0.331 0.6435 0.0195
0.155 0.3038 0.3952 0.146
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.005 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.369 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.256 0.584 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061
The above motifs work with version 0.12.11
, and the file without spaces works with the current 0.12.12
(in progress, might include more fixes in the future) on the dev branch. Thanks for letting me know of these issues!
Hi team,
I'm recently re-running some analyses and BINDetect is not reading the motif file correctly.
My original run was using the JASPAR2020 motif database. If we head the file:
However, I've been trying to get TOBIAS to read the motif database that comes with Gimmemotifs (https://github.com/vanheeringen-lab/gimmemotifs), just for consistency. If we head the file:
When I try and run this, I get the errror:
I edited the motif file to remove # items. Head of the edited motif file:
However when I try and run BINDetect, I get this error:
Somehow it reads over 5000 motifs? Any ideas would be appreciated!