snystrom / memes

An R interface to the MEME Suite
https://snystrom.github.io/memes/
Other
43 stars 5 forks source link

Motif names do not always match motif object name in `runStreme` output #104

Closed mniederhuber closed 7 months ago

mniederhuber commented 1 year ago

Heads up, when using runStreme the names of motif objects within the output dataframe do not always match the actual identified motif.

Screen Shot 2023-02-01 at 11 02 24 AM

This doesn't seem to be the case for runDreme output.

Environment: R 4.2.2 memes 1.6.0

snystrom commented 1 year ago

oh that's gross. I'm pretty sure I know why this happens. I thought I fixed it but guess not. Do you mind sharing the streme output file that it's parsing? You might have to rerun & set the outdir arg to a directory you can see, then just zip it up & attach here (or email if it's data you don't wanna share yet).

mniederhuber commented 1 year ago

sure thing, let me know if this attachment doesn't work. streme-output.zip

snystrom commented 1 year ago

That looks like it should work! I'll see if I can push a fix tonight after work.

This is assuredly because I never fixed this line: https://github.com/snystrom/memes/blob/457c24c2c9a1590d125351e91c62eb1b9e8bd25c/R/streme.R#L393

Honestly hilariously gnarly, I can only imagine I was in a fuge state when I wrote it.

snystrom commented 1 year ago

As a note to myself, I think what is likely happening here is the obviously wrong regex is messing up the sort order of the names (so m1m10 sorts after m1 but before m2 causing an off-by ~ N mod M error where N is the number of motifs >9, and M is the number of motifs divided by 10, jeez what a mess).

To confirm, I'm assuming in your current list, m1 is labeled correctly?

mniederhuber commented 1 year ago

That's right, m1 is labeled correctly, but subsequent names are mismatched.

snystrom commented 1 year ago

This is fixed in the bioc devel branch (master on github). I suggest you install using remotes::install_github("snystrom/memes") if you don't want to use all of bioc devel in your analysis code (which you probably shouldn't).

As part of this fix, I slightly modified how motif names are parsed to make them more consistent with other tools. So, something that would previously be m1_ATGC will now be m01_ATGC. This should make the Streme tools more consistent with how Dreme and TomTom behave. The sort order will now always be preserved in the order Streme returned them. This might break some code if you hard-coded any filters using these ids. Hope it's not too painful of a fix.

As a side note, I'm glad our old friend: CACACACACAYAC is still showing up as a top contender in things we care about.

mniederhuber commented 1 year ago

Great! I'm just going to reassign names of the streme output for now rather than install the dev branch. But great to know theres a patch! Won't break anything for me so all good.

snystrom commented 1 year ago

Yeah that seems like the correct approach. Based on my poking around, the list names were just incorrect, but the motif data was ok. You can prove this to yourself by comparing the streme output html to the motif list output (which I forgot to do, so definitely do that to make sure)

snystrom commented 1 year ago

And for what it's worth, if it is a problem, the only difference between the GitHub master and version 1.6.0 is this fix and another small patch to fix some typos so it should be safe to install.

seb-mueller commented 7 months ago

Just bringing this up again, since I still have this issue in version memes_1.8.0. Below is the example from help, apart from the fact that dream/meme/streme give very different results using defaults, the last streme figure shows that m2 and m4 don't match the logo.

library(tidyverse)
library(memes)
library(Biostrings)
library(IRanges)
library(universalmotif)
options(meme_bin="~/software/conda/envs/orbit_conda/bin/")
# Genome object
dm.genome <- BSgenome.Dmelanogaster.UCSC.dm6::BSgenome.Dmelanogaster.UCSC.dm6
# Generate sequences from 200bp about the center of my peaks of interest
seqs <- example_peaks %>% 
  resize(200, "center") %>% 
  get_sequence(dm.genome)

(meme_results <- runMeme(seqs, parse_genomic_coord = FALSE))
#>          motif                  name altname             consensus alphabet
#> 1 <mot:GSSW..> GSSWSMGSSTSCKSKCTGSSG  MEME-1 GSSWSMGSSTSCKSKCTGSSG      DNA
#>   strand  icscore nsites eval type pseudocount          bkg width   sites_hits
#> 1      + 14.66577      3  7.3  PPM           1 0.261, 0....    21 c("chr3L....
#> 
#> [Hidden empty columns: family, organism, bkgsites, pval, qval.]
meme_results %>% 
  to_list() %>% 
  view_motifs()


(dreme_results <- runDreme(seqs, control = "shuffle", e = 50))
#>                  motif      name altname consensus alphabet strand icscore
#> m01_AGAGC <mot:m01_..> m01_AGAGC DREME-1     AGAGC      DNA     +-      10
#>           nsites bkgsites   pval eval type pseudocount          bkg rank   seq
#> m01_AGAGC      9       10 0.0015  7.8  PCM           0 0.261, 0....    1 AGAGC
#>           length positive_hits negative_hits unerased_evalue positive_total
#> m01_AGAGC      5             7             0             7.8             10
#>           negative_total pos_frac neg_frac
#> m01_AGAGC             10      0.7        0
#> 
#> [Hidden empty columns: family, organism, qval.]
dreme_results %>% 
  to_list() %>% 
  view_motifs()


(streme_results <- runStreme(seqs, control = "shuffle", e = 50))
#> Warning: No hold-out set was created because the primary hold-out set
#>          would have had fewer than 5 sequences.
#> Warning: Ignoring <thresh> (0.05) and setting <nmotifs> to 5.
#>                         motif             name  altname     consensus alphabet
#> m1_AGAGCAGHVA    <mot:m1_A..>    m1_AGAGCAGHVA STREME-1    ARWGCAGNNA      DNA
#> m2_CTCTTTCCCHTCT <mot:m2_C..> m2_CTCTTTCCCHTCT STREME-2 CTCTTTCCCHWMT      DNA
#> m3_ATGGAATGAA    <mot:m3_A..>    m3_ATGGAATGAA STREME-3    WTGGAATGAA      DNA
#> m4_CTTTMATRT     <mot:m4_C..>     m4_CTTTMATRT STREME-4     CTTTMATRT      DNA
#> m5_RMSARCCAACGV  <mot:m5_R..>  m5_RMSARCCAACGV STREME-5  RMSARCCAACGS      DNA
#>                  strand  icscore nsites pval type pseudocount          bkg
#> m1_AGAGCAGHVA        +- 15.37744      8    1  PCM           0 0.285, 0....
#> m2_CTCTTTCCCHTCT     +- 16.02660      8    1  PCM           0 0.285, 0....
#> m3_ATGGAATGAA        +- 11.40899      7    1  PCM           0 0.285, 0....
#> m4_CTTTMATRT         +- 13.46741      7    1  PCM           0 0.285, 0....
#> m5_RMSARCCAACGV      +- 16.24358      7    1  PCM           0 0.285, 0....
#>                  width initial_width          seed score_threshold pos_count
#> m1_AGAGCAGHVA       10             6    AGAGCAGTCA         4.92334         0
#> m2_CTCTTTCCCHTCT    13            14 CTCTTTCCCATCT         8.84603         0
#> m3_ATGGAATGAA       10             8    ATGGAATGAA         10.8311         0
#> m4_CTTTMATRT         9             6     CTTTCATAT         10.7169         0
#> m5_RMSARCCAACGV     12            10  ACCAACCAACGG         2.70898         0
#>                  neg_count log_pval log_evalue   evalue dtc bernoulli
#> m1_AGAGCAGHVA            0        0    0.69897 5.0e+000  -1        -1
#> m2_CTCTTTCCCHTCT         0        0    0.69897 5.0e+000  -1        -1
#> m3_ATGGAATGAA            0        0    0.69897 5.0e+000  -1        -1
#> m4_CTTTMATRT             0        0    0.69897 5.0e+000  -1        -1
#> m5_RMSARCCAACGV          0        0    0.69897 5.0e+000  -1        -1
#>                  train_pos_count train_neg_count train_log_pval train_pval
#> m1_AGAGCAGHVA                  8               0       -3.44705    0.00036
#> m2_CTCTTTCCCHTCT               8               0       -3.44705    0.00036
#> m3_ATGGAATGAA                  7               0       -2.81023     0.0015
#> m4_CTTTMATRT                   7               0       -2.81023     0.0015
#> m5_RMSARCCAACGV                7               0       -2.81023     0.0015
#>                  train_dtc train_bernoulli npassing is_palindromic elapsed_time
#> m1_AGAGCAGHVA           -1              -1        8             no          0.3
#> m2_CTCTTTCCCHTCT        -1              -1        8             no          0.1
#> m3_ATGGAATGAA           -1              -1        7             no          0.4
#> m4_CTTTMATRT            -1              -1        7             no          0.2
#> m5_RMSARCCAACGV         -1              -1        7             no          0.5
#>                  total_sites
#> m1_AGAGCAGHVA              8
#> m2_CTCTTTCCCHTCT           8
#> m3_ATGGAATGAA              7
#> m4_CTTTMATRT               7
#> m5_RMSARCCAACGV            7
#>                                                                                                                                                                                                                                                                                                                                                                                                        site_distr
#> m1_AGAGCAGHVA       0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
#> m2_CTCTTTCCCHTCT          0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#> m3_ATGGAATGAA       0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> m4_CTTTMATRT      0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> m5_RMSARCCAACGV         0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>                  max_sites site_hist
#> m1_AGAGCAGHVA            2     0 5 3
#> m2_CTCTTTCCCHTCT         1       0 8
#> m3_ATGGAATGAA            1       0 7
#> m4_CTTTMATRT             1       0 7
#> m5_RMSARCCAACGV          2     0 5 2
#> 
#> [Hidden empty columns: family, organism, bkgsites, qval, eval.]
streme_results %>% 
  to_list() %>% 
  view_motifs()

Created on 2023-11-30 with reprex v2.0.2

seb-mueller commented 7 months ago

Also, for some reason I can't reopen this issues, should I create a new one?

snystrom commented 7 months ago

Hey @seb-mueller, I think this is actually an artifact of the rendering process for these motifs. For the ones where the logo doesn't match the IUPAC name, you'll see that they're shown as [RC] (the m2_CTCTTTCCCHTCT logo in the final example, for instance), and if you reverse-complement the displayed logo, you'll see it does indeed match the IUPAC.

Or did I misunderstand your issue?

seb-mueller commented 7 months ago

Good spot, I didn't realize it could report it that way, at least that explains it for DNA Sequences. However I only constructed those for easy reproducibility, I spoted the actual issue with AA Sequences, where a reverse complement makes less sense. I only have a screenshot at hand, but happy to provide a reproducible example if needed. You can see that for instance m5 or m8 etc don't match up. Somehow it tries to do an RC which is not really working on AA sequences.

(streme_results <- runStreme(aaseqs, control = "shuffle", e = 50, alph = "protein", nmotifs = 10))
#                      motif          name   altname consensus alphabet strand  icscore nsites pval type pseudocount          bkg width initial_width      seed score_threshold
# m1_PWRKNACT   <mot:m1_P..>   m1_PWRKNACT  STREME-1  PWRKNACT       AA     +- 22.97021     18 0.24  PCM           0 0.0963, ....     8             6  PWRKNACT        0.480661
# m1m0_ICFWCPAN <mot:m2_N..>  m2_NANANAVQD  STREME-2 NANANAVQD       AA     +- 29.34581      8 0.50  PCM           0 0.0963, ....     9             6 NAAANAMQL        0.747902
# m2_NANANAVQD  <mot:m3_F..>   m3_FWIKLFNA  STREME-3  FWIKLFNA       AA     +- 22.65608     20 1.00  PCM           0 0.0963, ....     8             4  FWIKLFNA         4.46709
# m3_FWIKLFNA   <mot:m4_T..>   m4_TNAYNAIC  STREME-4  TNAYNAIC       AA     +- 20.52898     19 1.00  PCM           0 0.0963, ....     8             4  TNAYNAIC         6.10298
# m4_TNAYNAIC   <mot:m5_D..>   m5_DNASNWTI  STREME-5  DNASNWTI       AA     +- 18.40615     12 1.00  PCM           0 0.0963, ....     8             5  DNASNWTI         14.1228
# m5_DNASNWTI   <mot:m6_N..>   m6_NALVHRBA  STREME-6  NALVHRNA       AA     +- 22.68425     10 1.00  PCM           0 0.0963, ....     8             4  NALVHRDI         3.17051
# m6_NALVHRBA   <mot:m7_N..>   m7_NAPLHQRR  STREME-7  NAPLHQRR       AA     +- 21.81291     10 1.00  PCM           0 0.0963, ....     8             5  NAPLHQRR         14.0807
# m7_NAPLHQRR   <mot:m8_W..>   m8_WNARHNFT  STREME-8  WNARHNFT       AA     +- 22.90196      7 1.00  PCM           0 0.0963, ....     8             5  WNARHNST         12.9031
# m8_WNARHNFT   <mot:m9_H..>   m9_HQCSTFVH  STREME-9  HQCSTFVH       AA     +- 26.52898      5 1.00  PCM           0 0.0963, ....     8             4  HQCSTLVG        0.909532
# m9_HQCSTFVH   <mot:m1m0..> m1m0_ICFWCPAN STREME-10  ICFWCPAN       AA     +- 27.72584      5 1.00  PCM           0 0.0963, ....     8             4  ICFWSPFK         6.28344
#               pos_count neg_count  log_pval log_evalue   evalue dtc bernoulli train_pos_count train_neg_count train_log_pval train_pval train_dtc train_bernoulli npassing
# m1_PWRKNACT           2         0 -0.625541   0.374459 2.4e+000  -1       0.5              16               0       -4.81563    1.5e-05        -1        0.500066       18
# m1m0_ICFWCPAN         1         0  -0.30103    0.69897 5.0e+000  -1       0.5               7               0       -2.10681     0.0078        -1        0.500066        8
# m2_NANANAVQD          0         0         0          1 1.0e+001  -1       0.5              20               0       -6.01954    9.6e-07        -1        0.500066       20
# m3_FWIKLFNA           0         0         0          1 1.0e+001  -1       0.5              19               0       -5.71857    1.9e-06        -1        0.500066       19
# m4_TNAYNAIC           0         0         0          1 1.0e+001  -1       0.5              12               0       -3.61167    0.00024        -1        0.500066       12
# m5_DNASNWTI           0         0         0          1 1.0e+001  -1       0.5              10               0       -3.00973    0.00098        -1        0.500066       10
# m6_NALVHRBA           0         0         0          1 1.0e+001  -1       0.5              10               0       -3.00973    0.00098        -1        0.500066       10
# m7_NAPLHQRR           0         0         0          1 1.0e+001  -1       0.5               7               0       -2.10681     0.0078        -1        0.500066        7
# m8_WNARHNFT           0         1         0          1 1.0e+001  -1       0.5               5               0       -1.50486      0.031        -1        0.500066        6
# m9_HQCSTFVH           0         0         0          1 1.0e+001  -1       0.5               5               0       -1.50486      0.031        -1        0.500066        5
#               is_palindromic elapsed_time total_sites                                                                                           site_distr max_sites
# m1_PWRKNACT              n/a          0.2          16  0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0         2
# m1m0_ICFWCPAN            n/a          0.3           7    0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0         2
# m2_NANANAVQD             n/a          0.4          20  0 0 1 0 0 0 0 0 1 1 2 0 1 0 1 1 1 1 0 0 0 0 0 2 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0         2
# m3_FWIKLFNA              n/a          0.1          19  0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 2 0 1 0 0 0 1 0 1 1 2 0 0 0 0 0         1
# m4_TNAYNAIC              n/a          0.1          12  0 0 1 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0         1
# m5_DNASNWTI              n/a          0.5          10  0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0         2
# m6_NALVHRBA              n/a          0.2          10  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0         1
# m7_NAPLHQRR              n/a          0.3           7  0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0         1
# m8_WNARHNFT              n/a          0.6           5  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0         1
# m9_HQCSTFVH              n/a          0.5           5  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0         1
#               site_hist
# m1_PWRKNACT      0 14 2
# m1m0_ICFWCPAN     0 6 1
# m2_NANANAVQD     0 19 1
# m3_FWIKLFNA        0 19
# m4_TNAYNAIC        0 12
# m5_DNASNWTI       0 8 2
# m6_NALVHRBA        0 10
# m7_NAPLHQRR         0 7
# m8_WNARHNFT         0 5
# m9_HQCSTFVH         0 5

streme_results %>% 
  to_list() %>% 
  view_motifs()

image

snystrom commented 7 months ago

Oh that's bizzare. Can you upload the .xml file output? You may need to rerun your code but set outdir to a location you can access on your drive. This should have been fixed in https://github.com/snystrom/memes/commit/1a3eeefe50cf9ebe88139294a9a26fa88945a410.

seb-mueller commented 7 months ago

Sure, I had to suffix a txt to the xml to be able to upload: streme.xml.txt Also, here is a screenshot from the html file: image

snystrom commented 7 months ago

Thanks! I'll take a look this evening.

snystrom commented 7 months ago

Woah. @seb-mueller I found the issue. Bioconductor's upstream copy of memes is not in sync with this repo. If you compare: https://code.bioconductor.org/browse/memes/blob/RELEASE_3_18/R/streme.R on bioc to the one from this repo: https://github.com/snystrom/memes/blob/master/R/streme.R you'll see clean_streme_ids() is missing.... I swear I upstreamed these changes, but maybe I forgot??

Also observe the commit history at the bioc mirror: https://code.bioconductor.org/browse/memes/commits/RELEASE_3_18

Regardless, I'll try to get the repo polished up & merged w/ the bioc mirror. Sorry for the hiccup. In the mean time, I guess you can install the github version using remotes::install_github. Ugh, what a mess.

seb-mueller commented 7 months ago

Thanks so much @snystrom ! Nice and quickly solved, cheers!