vanheeringen-lab / gimmemotifs

Suite of motif tools, including a motif prediction pipeline for ChIP-seq experiments. See full GimmeMotifs documentation for detailed installation instructions and usage examples.
https://gimmemotifs.readthedocs.io/en/master
MIT License
109 stars 33 forks source link

Add motif_db argument to gimme motifs #61

Open Rmulet opened 6 years ago

Rmulet commented 6 years ago

Hi,

So far I have been using the default database that comes bundled with Gimmemotifs, which is quite alright for most purposes. However, Jaspar 2018 has just been released and I wanted to repeat some analyses to see what new motifs may pop up. For some tools this is relatively straightforward (--pwmfile in match or maelstrom), but I did not see such an option in gimme motifs. Thus, I had to modify the config file, which is not as convenient as adding an argument.

I know it's a minor thing, but I leave this suggestion here in case you want to implement it if you deem it worthwhile.

PS: Now that I think of it, the number of CPUs to be used could also be a valuable addition to command line program calls.

simonvh commented 6 years ago

It's a good point. Similarly, I want to add some other arguments to the command-line too, such as the number of threads to use. I'll keep it on the to-do. (which I see you also mentioned...)

crazyhottommy commented 6 years ago

Hi,

The JASPAR motif is in a different format than gimmme accepts. How can I convert it? http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt

Thanks! Tommy

Rmulet commented 6 years ago

@crazyhottommy You can use functions in the gimmemotifs module to that end. Some useful functions are described in this link: http://gimmemotifs.readthedocs.io/en/master/api.html. I used the following code snippet a few months ago based on the examples there (with the right indentation, obviously):

# Read motifs in Jaspar format with open(jaspar_file) as f: motifs = readmotifs(f, fmt="jaspar") # Print them in .pwm format used by gimmemotifs for motif in motifs: motif.id = motif.id.replace('\t','') # I had to replace tabs with underscore to avoid logo issues outfile.write(motif.to_pwm())

crazyhottommy commented 6 years ago

@Rmulet thanks for sharing the code!

crazyhottommy commented 6 years ago

need to add the new line

jaspar_file = "JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt"
outfile = open("JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.pwm", "w")
with open(jaspar_file) as f:
        motifs = read_motifs(f, fmt="jaspar")
        # Print them in .pwm format used by gimmemotifs
        for motif in motifs:
                #  I had to replace tabs with underscore to avoid logo issues
                motif.id = motif.id.replace('\t','_') 
                outfile.write(motif.to_pwm() + '\n')
crazyhottommy commented 6 years ago

Hi,

I got a key error:

gimme maelstrom myinput.txt hg19 maelstrom.JASPAR18.out -p /apps/conda/envs/gimme/share/gimmemotifs/motif_databases/JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.pwm
2018-05-07 23:22:48,410 - INFO - Starting maelstrom
2018-05-07 23:22:48,423 - INFO - Motif scanning (counts)
2018-05-07 23:22:48,423 - INFO - reading table
2018-05-07 23:22:48,512 - INFO - setting threshold
Determining threshold for fpr 0.01 and length 200 based on hg19
2018-05-07 23:39:16,407 - INFO - creating count table
Traceback (most recent call last):
  File "/apps/conda/envs/gimme/bin/gimme", line 496, in <module>
    args.func(args)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/commands/maelstrom.py", line 29, in maelstrom
    run_maelstrom(infile, genome, outdir, pwmfile, methods=methods)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/maelstrom.py", line 193, in run_maelstrom
    pwmfile=pwmfile)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/maelstrom.py", line 61, in scan_to_table
    for row in s.count(regions):
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/scanner.py", line 445, in count
    for matches in self.scan(seqs, nreport, scan_rc):
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/scanner.py", line 496, in scan
    for result in it:
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/scanner.py", line 578, in _scan_sequences
    motifs = [(m, self.threshold[m.id]) for m in read_motifs(f)]
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/scanner.py", line 578, in <listcomp>
    motifs = [(m, self.threshold[m.id]) for m in read_motifs(f)]
KeyError: 'MA1140.1_JUNB(var.2)'

should I not change the tab to _?

crazyhottommy commented 6 years ago

UPDATE. if I did not change the tab to _, I got a different error:

gimme maelstrom  my_input.txt hg19 maelstrom_JASPAR18.out -p /apps/conda/envs/gimme/share/gimmemotifs/motif_databases/JASPAR2018_CORE_vertebrates_non-redundant_pfms_jaspar.pwm
2018-05-08 08:24:34,970 - INFO - Starting maelstrom
2018-05-08 08:24:34,983 - INFO - Motif scanning (counts)
2018-05-08 08:24:34,983 - INFO - reading table
2018-05-08 08:24:35,073 - INFO - setting threshold
2018-05-08 08:24:35,748 - INFO - creating count table
2018-05-08 08:44:44,214 - INFO - done
2018-05-08 08:44:44,295 - INFO - creating dataframe
2018-05-08 08:44:52,309 - INFO - Motif scanning (scores)
2018-05-08 08:44:52,309 - INFO - reading table
2018-05-08 08:44:52,552 - INFO - creating score table
2018-05-08 09:05:55,895 - INFO - done
2018-05-08 09:05:55,987 - INFO - creating dataframe
2018-05-08 09:06:32,577 - INFO - Fitting LightningClassification
2018-05-08 09:29:11,227 - INFO - Done
2018-05-08 09:29:13,426 - INFO - Fitting MWU
/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/moap.py:497: RuntimeWarning: divide by zero encountered in log10
  self.act_ = pd.DataFrame(-np.log10(fpr.T),
2018-05-08 09:29:19,802 - INFO - Done
2018-05-08 09:29:20,718 - INFO - Fitting Hypergeom
2018-05-08 09:29:25,571 - INFO - Done
2018-05-08 09:29:27,737 - INFO - Fitting RF
2018-05-08 09:30:46,125 - INFO - Done
2018-05-08 09:30:46,136 - INFO - Rank aggregation
2018-05-08 09:30:49,836 - INFO - html report
Traceback (most recent call last):
  File "/apps/conda/envs/gimme/bin/gimme", line 496, in <module>
    args.func(args)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/commands/maelstrom.py", line 29, in maelstrom
    run_maelstrom(infile, genome, outdir, pwmfile, methods=methods)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/maelstrom.py", line 309, in run_maelstrom
    pwmfile
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/gimmemotifs/report.py", line 252, in maelstrom_html_report
    f = df["factors"].str.len() > 30
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/pandas/core/generic.py", line 3077, in __getattr__
    return object.__getattribute__(self, name)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/pandas/core/base.py", line 243, in __get__
    return self.construct_accessor(instance)
  File "/apps/conda/envs/gimme/lib/python3.6/site-packages/pandas/core/strings.py", line 1909, in _make_str_accessor
    raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

The activity* files look fine, but the final.out.csv file only contains the TF name but not the scores for each cluster. and no html file was generated.

simonvh commented 6 years ago

Thanks @Rmulet for helping out. Sorry for not replying earlier @crazyhottommy, I was on vacation.

I think there is a bug relating to either motif names (you can try removing all special characters) or the link of motifs to TFs. I will have a look it at this once I find some time.

crazyhottommy commented 6 years ago

No problem! in the motif_database folder, there are files gimme.vertebrate.v3.1.motif2factors.txt I need to create one for JASPAR2018 as well?

simonvh commented 6 years ago

Yes, that should work. In the same directory as your motif file. It's a simple tab separated file with a header:

motif    factors
Motif1    Factor1,Factor2
Motif2    Factor3,Factor4,Factor5
crazyhottommy commented 6 years ago

when one uses different motifs .pwm files, how gimme maelstrom finds the correct linking file? I need to give the same prefix for the linking file and the pwm file?