loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
191 stars 41 forks source link

BINDetect not reading motif files correctly #78

Closed connorrogerson closed 3 years ago

connorrogerson commented 3 years ago

Hi team,

I'm recently re-running some analyses and BINDetect is not reading the motif file correctly.

My original run was using the JASPAR2020 motif database. If we head the file:

MA0006.1_Ahr::Arnt 0.1250 0.3333 0.0834 0.4583 0.0000 0.0000 0.9582 0.0417 0.0000 0.9582 0.0000 0.0417 0.0000 0.0000 0.9582 0.0417 0.0000 0.0000 0.0000 0.9999 0.0000 0.0000 0.9999 0.0000 MA0854.1_Alx1 0.1900 0.4400 0.1700 0.2000 0.0700 0.2100 0.6500 0.0700 0.3800 0.3300 0.1400 0.1500 0.4300 0.0900 0.3100 0.1700 0.0500 0.4000 0.0200 0.5300 0.0100 0.0700 0.0000 0.9200 0.9899 0.0000 0.0101 0.0000 0.9800 0.0000 0.0100 0.0100 0.0100 0.0100 0.0000 0.9800 0.0000 0.0101 0.0000 0.9899 0.9200 0.0000 0.0700 0.0100 0.5300 0.0200 0.4000 0.0500 0.1500 0.2100 0.1100 0.5300 0.2300 0.3200 0.1600 0.2900 0.3939 0.3131 0.1212 0.1717 0.2400 0.2900 0.2300 0.2400 0.1500 0.4000 0.1400 0.3100

However, I've been trying to get TOBIAS to read the motif database that comes with Gimmemotifs (https://github.com/vanheeringen-lab/gimmemotifs), just for consistency. If we head the file:

GM.5.0.Sox.0001 0.7213 0.0793 0.1103 0.0891 0.9259 0.0072 0.0062 0.0607 0.0048 0.9203 0.0077 0.0672 0.9859 0.0030 0.0030 0.0081 0.9778 0.0043 0.0128 0.0051 0.1484 0.0050 0.0168 0.8299 GM.5.0.Homeodomain.0001 0.8870 0.0000 0.0178 0.0951 0.1156 0.2033 0.6629 0.0181 0.0017 0.7452 0.0809 0.1722 0.0011 0.0003 0.0003 0.9983 0.0026 0.0141 0.9721 0.0111 0.0000 0.0189 0.0054 0.9758 0.0006 0.9983 0.0006 0.0006 0.9170 0.0140 0.0046 0.0644 0.2228 0.2421 0.3300 0.2051 0.3621 0.1054 0.2208 0.3116 0.5727 0.0104 0.1741 0.2428

GM.5.0.Mixed.0001 EGR1_disc6 EGR1_GM12878_encode-Myers_seq_hsa_v041610.1_r1:AlignACE#2#Intergenic;;SRF_disc2 SRF_HepG2_encode-Myers_seq_hsa_v041610.1_r1:AlignACE#3#Intergenic

GM.5.0.Mixed.0001 0.0061 0.3310 0.6435 0.0195 0.1550 0.3038 0.3952 0.1460 0.1556 0.2468 0.4775 0.1201 0.1152 0.2145 0.6121 0.0581 0.0050 0.3077 0.6282 0.0591 0.0027 0.4947 0.4958 0.0069 0.0994 0.4565 0.3690 0.0751 0.1759 0.3192 0.4488 0.0561 0.1084 0.2560 0.5840 0.0516 0.0031 0.1431 0.8534 0.0004 0.0168 0.5105 0.4666 0.0061

When I try and run this, I get the errror:

# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.pfm --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/ --cores 4

# ----- Input parameters -----
# signals:  ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks:    /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs:   ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.pfm']
# genome:   /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names:   ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header:  None
# naming:   name_id
# motif_pvalue: 0.0001
# bound_pvalue: 0.001
# pseudo:   None
# time_series:  False
# skip_excel:   False
# output_peaks: None
# prefix:   bindetect
# outdir:   /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores:    4
# split:    100
# verbosity:    3

# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf

2021-06-24 10:12:53 (76618) [INFO]  ----- Processing input data -----
2021-06-24 10:12:53 (76618) [INFO]  Checking reading/writing of files
2021-06-24 10:12:53 (76618) [INFO]  Reading peaks
2021-06-24 10:12:54 (76618) [INFO]  - Found 1213 regions in input peaks
2021-06-24 10:12:54 (76618) [INFO]  - Merged to 1213 regions
2021-06-24 10:12:54 (76618) [INFO]  Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 10:12:54 (76618) [INFO]  - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 10:12:54 (76618) [INFO]  - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 10:12:54 (76618) [INFO]  - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 10:12:54 (76618) [INFO]  Estimating GC content from peak sequences
2021-06-24 10:12:57 (76618) [INFO]  - GC content estimated at 51.22%
2021-06-24 10:12:57 (76618) [INFO]  Reading motifs from file
Traceback (most recent call last):
  File "/home/cjr78/miniconda3/envs/seq/bin/TOBIAS", line 33, in <module>
    sys.exit(load_entry_point('tobias==0.12.10', 'console_scripts', 'TOBIAS')())
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/TOBIAS.py", line 154, in main
    args.func(args)     
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/tools/bindetect.py", line 223, in run_bindetect
    motif_list += MotifList().from_file(f)  #List of OneMotif objects
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/utils/motifs.py", line 241, in from_file
    for m in motifs.parse(f, file_format):
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/__init__.py", line 116, in parse
    return jaspar.read(handle, fmt)
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/jaspar/__init__.py", line 164, in read
    record = _read_jaspar(handle)
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/jaspar/__init__.py", line 311, in _read_jaspar
    counts[nucleotides[row_count]] = [float(x) for x in words]
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/jaspar/__init__.py", line 311, in <listcomp>
    counts[nucleotides[row_count]] = [float(x) for x in words]
ValueError: could not convert string to float: '#GM.5.0.Mixed.0001'

I edited the motif file to remove # items. Head of the edited motif file:

GM.5.0.Sox.0001 0.7213 0.0793 0.1103 0.0891 0.9259 0.0072 0.0062 0.0607 0.0048 0.9203 0.0077 0.0672 0.9859 0.0030 0.0030 0.0081 0.9778 0.0043 0.0128 0.0051 0.1484 0.0050 0.0168 0.8299 GM.5.0.Homeodomain.0001 0.8870 0.0000 0.0178 0.0951 0.1156 0.2033 0.6629 0.0181 0.0017 0.7452 0.0809 0.1722 0.0011 0.0003 0.0003 0.9983 0.0026 0.0141 0.9721 0.0111 0.0000 0.0189 0.0054 0.9758 0.0006 0.9983 0.0006 0.0006 0.9170 0.0140 0.0046 0.0644 0.2228 0.2421 0.3300 0.2051 0.3621 0.1054 0.2208 0.3116 0.5727 0.0104 0.1741 0.2428 GM.5.0.Mixed.0001 0.0061 0.3310 0.6435 0.0195 0.1550 0.3038 0.3952 0.1460 0.1556 0.2468 0.4775 0.1201 0.1152 0.2145 0.6121 0.0581 0.0050 0.3077 0.6282 0.0591 0.0027 0.4947 0.4958 0.0069 0.0994 0.4565 0.3690 0.0751 0.1759 0.3192 0.4488 0.0561 0.1084 0.2560 0.5840 0.0516 0.0031 0.1431 0.8534 0.0004 0.0168 0.5105 0.4666 0.0061

However when I try and run BINDetect, I get this error:

# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.tobias.pfm --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/ --cores 4

# ----- Input parameters -----
# signals:      ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks:        /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs:       ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.tobias.pfm']
# genome:       /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names:   ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header:  None
# naming:       name_id
# motif_pvalue: 0.0001
# bound_pvalue: 0.001
# pseudo:       None
# time_series:  False
# skip_excel:   False
# output_peaks: None
# prefix:       bindetect
# outdir:       /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores:        4
# split:        100
# verbosity:    3

# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf

2021-06-24 11:29:49 (171212) [INFO]     ----- Processing input data -----
2021-06-24 11:29:49 (171212) [INFO]     Checking reading/writing of files
2021-06-24 11:29:49 (171212) [INFO]     Reading peaks
2021-06-24 11:29:49 (171212) [INFO]     - Found 1213 regions in input peaks
2021-06-24 11:29:49 (171212) [INFO]     - Merged to 1213 regions
2021-06-24 11:29:49 (171212) [INFO]     Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 11:29:49 (171212) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 11:29:49 (171212) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 11:29:49 (171212) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 11:29:49 (171212) [INFO]     Estimating GC content from peak sequences
2021-06-24 11:29:51 (171212) [INFO]     - GC content estimated at 51.22%
2021-06-24 11:29:51 (171212) [INFO]     Reading motifs from file
2021-06-24 11:29:52 (171212) [INFO]     - Read 5531 motifs
2021-06-24 11:29:52 (171212) [WARNING]  The motif output names (as given by --naming) are not unique.
2021-06-24 11:29:52 (171212) [WARNING]  The following names occur more than once: ['_']
2021-06-24 11:29:52 (171212) [WARNING]  These motifs will be renamed with '_1', '_2' etc. To prevent this renaming, please make the names of the input --motifs unique
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Somehow it reads over 5000 motifs? Any ideas would be appreciated!

connorrogerson commented 3 years ago

Tried it with a meme format motif database and receive this error:

TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

# TOBIAS 0.12.1 BINDetect (run started 2021-06-24 13:10:49.881592)
# Working directory: /rds/user/cjr78/hpc-work/ATAC/tobias
# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

# ----- Input parameters -----
# signals:      ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks:        /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs:       ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme']
# genome:       /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names:   ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header:  None
# naming:       name_id
# motif_pvalue: 0.0001
# bound_pvalue: 0.001
# pseudo:       None
# time_series:  False
# skip_excel:   False
# output_peaks: None
# prefix:       bindetect
# outdir:       /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores:        1
# split:        100
# verbosity:    3

# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf

2021-06-24 13:10:49 (213198) [INFO]     ----- Processing input data -----
2021-06-24 13:10:49 (213198) [INFO]     Checking reading/writing of files
2021-06-24 13:10:50 (213198) [INFO]     Reading peaks
2021-06-24 13:10:50 (213198) [INFO]     - Found 1213 regions in input peaks
2021-06-24 13:10:50 (213198) [INFO]     - Merged to 1213 regions
2021-06-24 13:10:50 (213198) [INFO]     Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 13:10:50 (213198) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 13:10:50 (213198) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 13:10:50 (213198) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 13:10:50 (213198) [INFO]     Estimating GC content from peak sequences
2021-06-24 13:10:57 (213198) [INFO]     - GC content estimated at 51.22%
2021-06-24 13:10:57 (213198) [INFO]     Reading motifs from file
Traceback (most recent call last):
  File "/home/cjr78/miniconda3/envs/seq/bin/TOBIAS", line 33, in <module>
    sys.exit(load_entry_point('tobias==0.12.1', 'console_scripts', 'TOBIAS')())
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/TOBIAS.py", line 154, in main
    args.func(args)
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/tools/bindetect.py", line 212, in run_bindetect
    motif_list += MotifList().from_file(f)  #List of OneMotif objects
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/utils/motifs.py", line 201, in from_file
    self[-1].n = int(info["nsites"]) #overwrites OneMotif default
ValueError: invalid literal for int() with base 10: '1000.0'
msbentsen commented 3 years ago

Hi, thank you for this issue! The first errors with gimme.vertebrate.v5.0.tobias.pfm arise because BINDetect expects either MEME or JASPAR-style motif file. I am not sure what this format is:

MA0006.1_Ahr::Arnt
0.1250 0.3333 0.0834 0.4583
0.0000 0.0000 0.9582 0.0417
0.0000 0.9582 0.0000 0.0417
0.0000 0.0000 0.9582 0.0417
0.0000 0.0000 0.0000 0.9999
0.0000 0.0000 0.9999 0.0000

I would love to add the option to read more formats, but I haven't been able to find a good source for the file format convention. Sometimes .pfm looks like the above, sometimes it looks like:

SIX5_disc1 SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic
G 0.008511 0.004255 0.987234 0.000000
A 0.902127 0.012766 0.038298 0.046809
R 0.455319 0.072340 0.344681 0.127660
W 0.251064 0.085106 0.085106 0.578724
T 0.000000 0.046809 0.012766 0.940425
G 0.000000 0.000000 1.000000 0.000000
T 0.038298 0.021277 0.029787 0.910638
A 0.944681 0.004255 0.051064 0.000000
G 0.000000 0.000000 1.000000 0.000000
T 0.000000 0.000000 0.012766 0.987234

So that is unfortunately a bit hard....

But for the error with the .meme file, this is definitely a bug from my side :-) In the letter-probability matrix-line, you probably have a float number nsites: letter-probability matrix: nsites= 1000.0 vs. letter-probability matrix: nsites= 1000

And the error arises because this can't be directly transformed into int. So changing nsites to an integer will also clear the error. I have made a small fix on the development branch (https://github.com/loosolab/TOBIAS/commit/e5251cc61aa5eec184558e51ea96c5a753f0b943), so you might also fetch and install that. I like to collect a few bugfixes before I release a new version, so I can't say when it will be officially 'out'. But I hope this helps you out!

Mette

connorrogerson commented 3 years ago

Hi I'm getting another error unfortunately.

This is my meme file:

MEME version 3.0

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF GM.5.0.Sox.0001
letter-probability matrix: alength= 4 w= 6 nsites= 1000.0 E= 0
0.7213 0.0793 0.1103 0.0891 0.9259 0.0072 0.0062 0.0607 0.0048 0.9203 0.0077 0.0672 0.9859 0.003 0.003 0.0081 0.9778 0.0043 0.0128 0.0051 0.1484 0.005 0.0168 0.8299 MOTIF GM.5.0.Homeodomain.0001
letter-probability matrix: alength= 4 w= 11 nsites= 999.9 E= 0
0.887 0 0.0178 0.0951 0.1156 0.2033 0.6629 0.0181 0.0017 0.7452 0.0809 0.1722 0.0011 0.0003 0.0003 0.9983 0.0026 0.0141 0.9721 0.0111 0 0.0189 0.0054 0.9758 0.0006 0.9983 0.0006 0.0006 0.917 0.014 0.0046 0.0644 0.2228 0.2421 0.33 0.2051 0.3621 0.1054 0.2208 0.3116 0.5727 0.0104 0.1741 0.2428 MOTIF GM.5.0.Mixed.0001
letter-probability matrix: alength= 4 w= 11 nsites= 1000.1 E= 0
0.0061 0.331 0.6435 0.0195 0.155 0.3038 0.3952 0.146 0.1556 0.2468 0.4775 0.1201 0.1152 0.2145 0.6121 0.0581 0.005 0.3077 0.6282 0.0591 0.0027 0.4947 0.4958 0.0069 0.0994 0.4565 0.369 0.0751 0.1759 0.3192 0.4488 0.0561 0.1084 0.256 0.584 0.0516 0.0031 0.1431 0.8534 0.0004 0.0168 0.5105 0.4666 0.0061

The run output is: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

TOBIAS 0.12.12 BINDetect (run started 2021-06-25 16:50:29.791082)

Working directory: /rds/user/cjr78/hpc-work/ATAC/tobias

Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

----- Input parameters -----

signals: ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']

peaks: /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed

motifs: ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme']

genome: /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa

cond_names: ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']

peak_header: None

naming: name_id

motif_pvalue: 0.0001

bound_pvalue: 0.001

pseudo: None

time_series: False

skip_excel: False

output_peaks: None

prefix: bindetect

outdir: /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated

cores: 1

split: 100

verbosity: 3

----- Output files -----

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_7_8_9_corrected_bound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_7_8_9_corrected_unbound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_1_2_3_corrected_bound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_1_2_3_corrected_unbound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_all.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//plots/_log2fcs.pdf

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//_overview.txt

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//_overview.xlsx

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf

2021-06-25 16:50:29 (127630) [INFO] ----- Processing input data ----- 2021-06-25 16:50:29 (127630) [INFO] Checking reading/writing of files 2021-06-25 16:50:29 (127630) [INFO] Reading peaks 2021-06-25 16:50:29 (127630) [INFO] - Found 1213 regions in input peaks 2021-06-25 16:50:29 (127630) [INFO] - Merged to 1213 regions 2021-06-25 16:50:29 (127630) [INFO] Checking for match between --peaks and --fasta/--signals boundaries 2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa 2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw 2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw 2021-06-25 16:50:29 (127630) [INFO] Estimating GC content from peak sequences 2021-06-25 16:50:30 (127630) [INFO] - GC content estimated at 51.22% 2021-06-25 16:50:30 (127630) [INFO] Reading motifs from file ERROR: No matrix could be read for motif 'GM.5.0.Sox.0001 ' - please check the format of the input motif file.

msbentsen commented 3 years ago

Hi Connor,

This was a small oversight from my end - I falsely assumed there would always be at least one line between two motifs, e.g.:

MEME version 3.0

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF GM.5.0.Sox.0001
letter-probability matrix: alength= 4 w= 6 nsites= 1000.0 E= 0
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.003 0.003 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.005 0.0168 0.8299

MOTIF GM.5.0.Homeodomain.0001
letter-probability matrix: alength= 4 w= 11 nsites= 999.9 E= 0
0.887 0 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.917 0.014 0.0046 0.0644
0.2228 0.2421 0.33 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428

MOTIF GM.5.0.Mixed.0001
letter-probability matrix: alength= 4 w= 11 nsites= 1000.1 E= 0
0.0061 0.331 0.6435 0.0195
0.155 0.3038 0.3952 0.146
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.005 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.369 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.256 0.584 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061

The above motifs work with version 0.12.11, and the file without spaces works with the current 0.12.12 (in progress, might include more fixes in the future) on the dev branch. Thanks for letting me know of these issues!