vanheeringen-lab / gimmemotifs

Suite of motif tools, including a motif prediction pipeline for ChIP-seq experiments. See full GimmeMotifs documentation for detailed installation instructions and usage examples.
https://gimmemotifs.readthedocs.io/en/master
MIT License
108 stars 33 forks source link

Gimme motifs can't read intermediate file #167

Closed connorrogerson closed 3 years ago

connorrogerson commented 3 years ago

Describe the bug When running gimme motifs with the following parameters: gimme motifs -s 0 -f 0.5 -g mm10 --denovo Alluvial_open_GM.5.0.Forkhead.0008.bed Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs

We get the following error:

2021-01-13 17:29:57,285 - INFO - creating background (matched GC%) Sequences do not seem to be of equal size. GC% matched sequences of the median size (300) will be created 2021-01-13 17:30:15,741 - INFO - starting full motif analysis 2021-01-13 17:30:15,741 - INFO - using original size 2021-01-13 17:30:15,741 - INFO - preparing input from BED Please provide input file in BED or FASTA format Traceback (most recent call last): File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/background.py", line 468, in matched_gc_bedfile for seq in fa.seqs File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/background.py", line 468, in for seq in fa.seqs ZeroDivisionError: division by zero

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/cjr78/miniconda3/envs/gimme/bin/gimme", line 11, in cli(sys.argv[1:]) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/cli.py", line 625, in cli args.func(args) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/commands/motifs.py", line 94, in motifs "size": args.size, File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/denovo.py", line 609, in gimme_motifs params.get("custom_background", None), File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/denovo.py", line 316, in create_backgrounds custom_background=custom_background, File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/denovo.py", line 226, in create_background f = MatchedGcFasta(fafile, genome, nr_times len(fg)) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/background.py", line 559, in init matched_gc_bedfile(tmpbed, matchfile, genome, number, size=size) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/background.py", line 477, in matched_gc_bedfile [float(x[fields + 1]) for x in bed.nucleotide_content(fi=genome_fa)] File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/pybedtools/bedtool.py", line 917, in decorated result = method(self, args, **kwargs) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/pybedtools/bedtool.py", line 401, in wrapped decode_output=decode_output, File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/pybedtools/helpers.py", line 455, in call_bedtools raise BEDToolsError(subprocess.list2cmdline(cmds), stderr) pybedtools.helpers.BEDToolsError: Command was:

    bedtools nuc -fi /home/cjr78/.local/share/genomes/mm10/mm10.fa -bed Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/prediction.fa

Error message was: It looks as though you have less than 3 columns at line 1 in file Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/prediction.fa. Are you sure your files are tab-delimited?`

To Reproduce Run gimme motifs

Expected behavior For gimme motifs to run and read its own intermediate files

Error logs Error log: 2021-01-13 17:30:15,735 - gimme.config - DEBUG - Using multiprocessing 2021-01-13 17:30:15,736 - gimme.config - DEBUG - Parameters: 2021-01-13 17:30:15,736 - gimme.config - DEBUG - fraction: 0.2 2021-01-13 17:30:15,736 - gimme.config - DEBUG - use_strand: False 2021-01-13 17:30:15,736 - gimme.config - DEBUG - abs_max: 1000 2021-01-13 17:30:15,736 - gimme.config - DEBUG - analysis: xl 2021-01-13 17:30:15,736 - gimme.config - DEBUG - enrichment: 1.5 2021-01-13 17:30:15,736 - gimme.config - DEBUG - size: 0 2021-01-13 17:30:15,736 - gimme.config - DEBUG - lsize: 500 2021-01-13 17:30:15,736 - gimme.config - DEBUG - background: ['gc'] 2021-01-13 17:30:15,736 - gimme.config - DEBUG - cluster_threshold: 0.95 2021-01-13 17:30:15,736 - gimme.config - DEBUG - scan_cutoff: 0.9 2021-01-13 17:30:15,737 - gimme.config - DEBUG - available_tools: MDmodule,MEME,MEMEW,DREME,Weeder,GADEM,MotifSampler,Trawler,Improbizer,BioProspector,Posmo,ChIPMunk,AMD,HMS,Homer,XXmotif,ProSampler,DiNAMO 2021-01-13 17:30:15,737 - gimme.config - DEBUG - tools: MEME,Homer,BioProspector 2021-01-13 17:30:15,737 - gimme.config - DEBUG - pvalue: 0.001 2021-01-13 17:30:15,737 - gimme.config - DEBUG - max_time: -1 2021-01-13 17:30:15,737 - gimme.config - DEBUG - ncpus: 12 2021-01-13 17:30:15,737 - gimme.config - DEBUG - motif_db: gimme.vertebrate.v5.0.pfm 2021-01-13 17:30:15,737 - gimme.config - DEBUG - use_cache: False 2021-01-13 17:30:15,737 - gimme.config - DEBUG - custom_background: Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/generated_background.gc.fa 2021-01-13 17:30:15,737 - gimme.config - DEBUG - genome: mm10 2021-01-13 17:30:15,737 - gimme.config - DEBUG - No time limit for motif prediction 2021-01-13 17:30:15,741 - gimme.denovo - INFO - starting full motif analysis 2021-01-13 17:30:15,741 - gimme.denovo - DEBUG - Using temporary directory /tmp/gimmemotifs.151283.xrc7uklp 2021-01-13 17:30:15,741 - gimme.denovo - INFO - using original size 2021-01-13 17:30:15,741 - gimme.denovo - INFO - preparing input from BED 2021-01-13 17:30:15,746 - gimme.denovo - DEBUG - Splitting Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/input.bed into prediction set (Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/prediction.bed) and validation set (Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/validation.bed) 2021-01-13 17:30:16,053 - gimme.denovo - DEBUG - Creating GC matched background

Installation information (please complete the following information):

Additional context Add any other context about the problem here.

Head of input bed file looks like this:

chr10 69165952 69166203 chr11 98750442 98751534 chr3 21895407 21895988 chr2 102815334 102815552 chr17 93484588 93484861 chr4 92547518 92547673 chr2 117637363 117637832 chr16 52000168 52000605 chr11 23401137 23401330 chr18 58143451 58143760`

Head of Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/prediction.fa looks like this:

chr10:117988166-117988166

chr10:39490581-39490581

chr10:69114610-69114610

chr11:100894221-100894221

chr11:103115577-103115577

Head of Alluvial_open_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/localization.fa

chr9:53356720-53356720 CCGAAATGAATACGTGATTTTAAGCCACAGGGCCCAGAACACATTGCTGATCGTGATCTCTGCCAGAGATCAGACACAGAACAAATCTGGGAGCACCTTAAAATTTAATTTTGTTGACTTTACCAAGGCATTTGATCGTATCAGGCAGTGTGGGTTCCTACTCCCTCAGCCGCCCTGCAGGTTTGGCCACCCTAGGAAGTTCTCAAGTGTTCGAAGGTTACACTCTGGCAGGAGCTGTGTTTGTTTAGTTATGTCAACACTGCAGACAGTATCAGGAGAGGGCGTTGAAGGTGTGTAAAAGGCCTTGTTTATTCAGTGGCTTTTATGTTCCCCGTTTCCTGAGAGCCAGCATCTAAAAACGCTGGTCTCATACAGATCCTGTTGGGCAGCTGCCTAGCTCATGAGGTCTCAGGTGTTCTCCAGAAAATTGTCCATAACCCCAAAGCACAGCAAGAGTAGGCAGGGAGGACTTGCTAATGCCCAACCCGTCTTCTCAGCAG chr11:114907707-114907707 ccactcagctatgtccttgattcTAATACTCTTTTTACtattttatttagagactggtgtcttgttaaactgcccaggttgaccttgaactcatcctgtaactcaagcaggccttgaacttgctatctatcctcctgcctcagtttcacctggccctactgcACAACAGGGGTTAAAATTCATATCTGTTCAGATGGAATGGTCAAAGTAGAGACATCCACAAAAGAGGATCACCATGGATATTTACATGACGACAGTAACTGTCACCAAAGAGTCTGCCCAAACCACAGGCTTGCACAAGTCATGATGACCTTGCACAAAGAACCACAGCCTTGCAGCGCGACATCAGCCCAGAAGCTATCTGTTCAGCCTCACACTGGCTCCACCTTTAATACTGACCTTTGTGGTGGAGGACCATTGCTTCAACACAGCTTATGTAACCTTCCCGACCTTCCTGTTAATCCTCACCTTCCTTACTCTCTCACCCTATGACATCTTGC chr15:60166898-60166898 atcctcgtctacatatgtatTGTTAAGAAAAAGAAGGATTTTTTTTTTTACTTAAATACATGAATACTTTTCTAGAGTACAAAAGAGAAAGTATAAatgaatttgtctataaaattttaaataaacctttagtattcaatcttatgctgatgcaatactgttatttgcttttctttattgaaatgatgacattttcttagattagtcatcttttctttataagataaagtaaaagacttttatttaccttatgagtaatcactcaagaagctaaacatttgggttcttaacagaattatttttatgttcagtattgatttgaccatgtttttgattattactgaaataagatcttcACTTAAAATATCTGGAGAGGagtcccacagtccccagaggagtctccactctcaggcactctagcacgtccaggatcttagggtcactggtgagtagaacacaacatttgttccaataccaccagtagtgactggaaacagcag chr12:56905999-56905999 TGCTTGTATAGCAGTTTAAGGGTCAAATGCTGCTCTGCCCGGTGCCTCTGACATCACTTAACTCTCCCCGGGAGAGGGGCACTAGCGTCATTCTCTGGGCTGGCCAGCAGCAGGCCTGAAGGTGGGTCCGCAGCAGCTCAGTCTTGTTTCTCCTGGAGACCTACAGGGTCAGAAACTGTGAGGGAGCATTGCAACTAAGACTCTAAGTCTTTGTTTGTATTGGCTTGAGAGAGAATTTTTTTCACAATTTATGAAAACACAATGTAGCCAGAAAGTTACAATGTAAATGTGAGTTTCAAACTTTGCTGTAGTCATAGGTTAATCGTTAGTCATTTGTGGTTGACAAGGGAGGTTATAATTTGACACCTGGCAAAAAATCAAGCCCATTTCAGAGAAGTTAGGAGCTGGGAAATGAGATGGAAAAACACCCTCCATTTCCCTAGGAGGTCCCATGAAAAAGCATTTAGCAAAATCTCTTCTCTCCCTAGAAGTCAATTCCT chr3:45534546-45534546 ACACTTCATCTCCTGCTGCATCCTTTCCTCGGGGCTACATGTTACTGATGTAAGTGTTTTGTCTTCTTTTACAGGGACACACTATATATCTTAAAATTGTGAAGAAAATGTAGTTAAGTCAATGAGAATTGTGTAATCATCACTACAACCAGTCAACTTTACCACCTTCTCCATGAATCAAGTGATACATTAACCTAATTAGTAAGACTATTTAAGGGACTTCAGGGACAAGAATATCAGAGCTACATTTTCTATAAACAGTATTCATATTCTGAGAGTAAAAACTAGTAATTATTTTCTCATTTCAGTAATGTTATTCAGAAAAAAGATGAACAATAGCAGAGAAATGAATCAGAGAACAACCATCCTTTATCTCCCTCACATTTATGATTCACTGGTAGAATATTACCTTAGGGAATTTTTGTTTGGGAAAACTATAGGCATGGAAACATAATTTTTTTTAtgcccctctcctggtccaccctcccacagtttctcat

simonvh commented 3 years ago

hi @connorrogerson, I see that you have used the -s 0 flag, which means that the original size of regions is used. Could it be that some regions in the BED file have size 0? For instance, chr10:117988166-117988166? T

connorrogerson commented 3 years ago

Hi @simonvh my original bed file doesn't have any 0 size intervals. Maybe -s 0 is resizing them to 0?

If I remove the -s flag, the command line: gimme motifs -b /rds/user/cjr78/hpc-work/ATAC/macs2/merged_all/merged_final_fixed_peaks.fa -g mm10 --denovo Alluvial_open_activated_GM.5.0.Forkhead.0008.bed Alluvial_open_activated_GM.5.0.Forkhead.0008_gimmemotifs

This runs fine until I get another error:

2021-01-14 19:37:52,155 - INFO - starting full motif analysis 2021-01-14 19:37:52,155 - INFO - using size of 200, set size to 0 to use original region size 2021-01-14 19:37:52,156 - INFO - preparing input from BED 2021-01-14 19:37:59,742 - INFO - Copying custom background file /rds/user/cjr78/hpc-work/ATAC/macs2/merged_all/merged_final_fixed_peaks.fa to Alluvial_open_activated_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/prediction.bg.fa. 2021-01-14 19:38:01,789 - WARNING - The custom background file /rds/user/cjr78/hpc-work/ATAC/macs2/merged_all/merged_final_fixed_peaks.fa contains sequences with a median size of 277.0, while GimmeMotifs predicts motifs in sequences of size 200. This will influence the statistics! It is recommended to use background sequences of the same size. 2021-01-14 19:38:02,103 - INFO - Copying custom background file /rds/user/cjr78/hpc-work/ATAC/macs2/merged_all/merged_final_fixed_peaks.fa to Alluvial_open_activated_GM.5.0.Forkhead.0008_gimmemotifs/intermediate/bg.custom.fa. 2021-01-14 19:38:03,962 - WARNING - The custom background file /rds/user/cjr78/hpc-work/ATAC/macs2/merged_all/merged_final_fixed_peaks.fa contains sequences with a median size of 277.0, while GimmeMotifs predicts motifs in sequences of size 200. This will influence the statistics! It is recommended to use background sequences of the same size. 2021-01-14 19:38:04,244 - INFO - starting motif prediction (xl) 2021-01-14 19:38:04,244 - INFO - tools: MEME, BioProspector, Homer 2021-01-14 19:38:06,178 - INFO - all jobs submitted 2021-01-14 19:38:10,764 - INFO - BioProspector_width_6 finished, found 5 motifs 2021-01-14 19:38:11,122 - INFO - BioProspector_width_8 finished, found 5 motifs 2021-01-14 19:38:11,429 - INFO - BioProspector_width_10 finished, found 5 motifs 2021-01-14 19:38:11,833 - INFO - BioProspector_width_12 finished, found 5 motifs 2021-01-14 19:38:12,003 - INFO - BioProspector_width_14 finished, found 5 motifs 2021-01-14 19:38:12,246 - INFO - BioProspector_width_16 finished, found 5 motifs 2021-01-14 19:38:12,517 - INFO - BioProspector_width_18 finished, found 5 motifs 2021-01-14 19:38:12,755 - INFO - BioProspector_width_20 finished, found 5 motifs 2021-01-14 19:38:52,766 - INFO - MEME_width_12 finished, found 10 motifs 2021-01-14 19:38:53,987 - INFO - MEME_width_10 finished, found 10 motifs 2021-01-14 19:38:55,282 - INFO - MEME_width_8 finished, found 10 motifs 2021-01-14 19:38:57,992 - INFO - MEME_width_6 finished, found 10 motifs 2021-01-14 19:39:14,379 - INFO - Homer_width_6 finished, found 5 motifs 2021-01-14 19:39:31,900 - INFO - MEME_width_14 finished, found 10 motifs 2021-01-14 19:39:33,734 - INFO - Homer_width_8 finished, found 5 motifs 2021-01-14 19:39:34,656 - INFO - MEME_width_18 finished, found 10 motifs 2021-01-14 19:39:37,034 - INFO - MEME_width_20 finished, found 10 motifs 2021-01-14 19:39:39,040 - INFO - MEME_width_16 finished, found 10 motifs 2021-01-14 19:40:11,144 - INFO - Homer_width_10 finished, found 5 motifs 2021-01-14 19:41:26,646 - INFO - Homer_width_12 finished, found 5 motifs 2021-01-14 19:53:14,696 - INFO - Homer_width_14 finished, found 5 motifs 2021-01-14 20:05:10,847 - INFO - Homer_width_16 finished, found 5 motifs 2021-01-14 20:25:24,517 - INFO - Homer_width_18 finished, found 5 motifs 2021-01-14 20:45:02,464 - INFO - Homer_width_20 finished, found 5 motifs 2021-01-14 20:46:47,511 - INFO - predicted 160 motifs 2021-01-14 20:46:47,592 - INFO - 43 motifs are significant 2021-01-14 20:46:47,905 - INFO - clustering 43 motifs. 2021-01-14 20:47:40,988 - INFO - creating de novo reports 2021-01-14 20:48:18,105 - INFO - finished 2021-01-14 20:48:18,106 - INFO - output dir: Alluvial_open_activated_GM.5.0.Forkhead.0008_gimmemotifs 2021-01-14 20:48:18,106 - INFO - de novo report: Alluvial_open_activated_GM.5.0.Forkhead.0008_gimmemotifs/gimme.denovo.html 2021-01-14 20:49:06,833 - INFO - creating motif scan tables 2021-01-14 20:49:28,531 - INFO - calculating stats 2021-01-14 20:49:29,806 - INFO - selecting non-redundant motifs Traceback (most recent call last): File "/home/cjr78/miniconda3/envs/gimme/bin/gimme", line 11, in cli(sys.argv[1:]) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/cli.py", line 625, in cli args.func(args) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/commands/motifs.py", line 213, in motifs tolerance=0.001, File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/gimmemotifs/comparison.py", line 985, in select_nonredundant_motifs fit = rfe.fit(X_bla, y_train) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/sklearn/feature_selection/_rfe.py", line 149, in fit return self._fit(X, y) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/sklearn/feature_selection/_rfe.py", line 159, in _fit force_all_finite=not tags.get('allow_nan', True)) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/sklearn/utils/validation.py", line 755, in check_X_y estimator=estimator) File "/home/cjr78/miniconda3/envs/gimme/lib/python3.7/site-packages/sklearn/utils/validation.py", line 475, in check_array dtype_orig = np.result_type(*dtypes_orig) File "<__array_function__ internals>", line 6, in result_type ValueError: at least one array or dtype is required

connorrogerson commented 3 years ago

So updating gimmemotifs with pip seems to have sorted the issues. FYI, trying to install or update with conda seems to be taking a while i.e. solving environment issues

simonvh commented 3 years ago

Closing, as this issue is fixed (new release is now on conda as well). If you have problems with solving the environment, I can really recommend mamba as a drop-in replacement of conda. It is really fast, and we have encountered no issues so far.