Closed mniederhuber closed 7 months ago
oh that's gross. I'm pretty sure I know why this happens. I thought I fixed it but guess not. Do you mind sharing the streme output file that it's parsing? You might have to rerun & set the outdir
arg to a directory you can see, then just zip it up & attach here (or email if it's data you don't wanna share yet).
sure thing, let me know if this attachment doesn't work. streme-output.zip
That looks like it should work! I'll see if I can push a fix tonight after work.
This is assuredly because I never fixed this line: https://github.com/snystrom/memes/blob/457c24c2c9a1590d125351e91c62eb1b9e8bd25c/R/streme.R#L393
Honestly hilariously gnarly, I can only imagine I was in a fuge state when I wrote it.
As a note to myself, I think what is likely happening here is the obviously wrong regex is messing up the sort order of the names (so m1m10 sorts after m1 but before m2 causing an off-by ~ N mod M error where N is the number of motifs >9, and M is the number of motifs divided by 10, jeez what a mess).
To confirm, I'm assuming in your current list, m1 is labeled correctly?
That's right, m1 is labeled correctly, but subsequent names are mismatched.
This is fixed in the bioc devel branch (master
on github). I suggest you install using remotes::install_github("snystrom/memes")
if you don't want to use all of bioc devel in your analysis code (which you probably shouldn't).
As part of this fix, I slightly modified how motif names are parsed to make them more consistent with other tools. So, something that would previously be m1_ATGC
will now be m01_ATGC
. This should make the Streme
tools more consistent with how Dreme and TomTom behave. The sort order will now always be preserved in the order Streme returned them. This might break some code if you hard-coded any filters using these ids. Hope it's not too painful of a fix.
As a side note, I'm glad our old friend: CACACACACAYAC
is still showing up as a top contender in things we care about.
Great! I'm just going to reassign names of the streme output for now rather than install the dev branch. But great to know theres a patch! Won't break anything for me so all good.
Yeah that seems like the correct approach. Based on my poking around, the list names were just incorrect, but the motif data was ok. You can prove this to yourself by comparing the streme output html to the motif list output (which I forgot to do, so definitely do that to make sure)
And for what it's worth, if it is a problem, the only difference between the GitHub master and version 1.6.0 is this fix and another small patch to fix some typos so it should be safe to install.
Just bringing this up again, since I still have this issue in version memes_1.8.0
.
Below is the example from help, apart from the fact that dream/meme/streme give very different
results using defaults, the last streme figure shows that m2 and m4 don't match the logo.
library(tidyverse)
library(memes)
library(Biostrings)
library(IRanges)
library(universalmotif)
options(meme_bin="~/software/conda/envs/orbit_conda/bin/")
# Genome object
dm.genome <- BSgenome.Dmelanogaster.UCSC.dm6::BSgenome.Dmelanogaster.UCSC.dm6
# Generate sequences from 200bp about the center of my peaks of interest
seqs <- example_peaks %>%
resize(200, "center") %>%
get_sequence(dm.genome)
(meme_results <- runMeme(seqs, parse_genomic_coord = FALSE))
#> motif name altname consensus alphabet
#> 1 <mot:GSSW..> GSSWSMGSSTSCKSKCTGSSG MEME-1 GSSWSMGSSTSCKSKCTGSSG DNA
#> strand icscore nsites eval type pseudocount bkg width sites_hits
#> 1 + 14.66577 3 7.3 PPM 1 0.261, 0.... 21 c("chr3L....
#>
#> [Hidden empty columns: family, organism, bkgsites, pval, qval.]
meme_results %>%
to_list() %>%
view_motifs()
(dreme_results <- runDreme(seqs, control = "shuffle", e = 50))
#> motif name altname consensus alphabet strand icscore
#> m01_AGAGC <mot:m01_..> m01_AGAGC DREME-1 AGAGC DNA +- 10
#> nsites bkgsites pval eval type pseudocount bkg rank seq
#> m01_AGAGC 9 10 0.0015 7.8 PCM 0 0.261, 0.... 1 AGAGC
#> length positive_hits negative_hits unerased_evalue positive_total
#> m01_AGAGC 5 7 0 7.8 10
#> negative_total pos_frac neg_frac
#> m01_AGAGC 10 0.7 0
#>
#> [Hidden empty columns: family, organism, qval.]
dreme_results %>%
to_list() %>%
view_motifs()
(streme_results <- runStreme(seqs, control = "shuffle", e = 50))
#> Warning: No hold-out set was created because the primary hold-out set
#> would have had fewer than 5 sequences.
#> Warning: Ignoring <thresh> (0.05) and setting <nmotifs> to 5.
#> motif name altname consensus alphabet
#> m1_AGAGCAGHVA <mot:m1_A..> m1_AGAGCAGHVA STREME-1 ARWGCAGNNA DNA
#> m2_CTCTTTCCCHTCT <mot:m2_C..> m2_CTCTTTCCCHTCT STREME-2 CTCTTTCCCHWMT DNA
#> m3_ATGGAATGAA <mot:m3_A..> m3_ATGGAATGAA STREME-3 WTGGAATGAA DNA
#> m4_CTTTMATRT <mot:m4_C..> m4_CTTTMATRT STREME-4 CTTTMATRT DNA
#> m5_RMSARCCAACGV <mot:m5_R..> m5_RMSARCCAACGV STREME-5 RMSARCCAACGS DNA
#> strand icscore nsites pval type pseudocount bkg
#> m1_AGAGCAGHVA +- 15.37744 8 1 PCM 0 0.285, 0....
#> m2_CTCTTTCCCHTCT +- 16.02660 8 1 PCM 0 0.285, 0....
#> m3_ATGGAATGAA +- 11.40899 7 1 PCM 0 0.285, 0....
#> m4_CTTTMATRT +- 13.46741 7 1 PCM 0 0.285, 0....
#> m5_RMSARCCAACGV +- 16.24358 7 1 PCM 0 0.285, 0....
#> width initial_width seed score_threshold pos_count
#> m1_AGAGCAGHVA 10 6 AGAGCAGTCA 4.92334 0
#> m2_CTCTTTCCCHTCT 13 14 CTCTTTCCCATCT 8.84603 0
#> m3_ATGGAATGAA 10 8 ATGGAATGAA 10.8311 0
#> m4_CTTTMATRT 9 6 CTTTCATAT 10.7169 0
#> m5_RMSARCCAACGV 12 10 ACCAACCAACGG 2.70898 0
#> neg_count log_pval log_evalue evalue dtc bernoulli
#> m1_AGAGCAGHVA 0 0 0.69897 5.0e+000 -1 -1
#> m2_CTCTTTCCCHTCT 0 0 0.69897 5.0e+000 -1 -1
#> m3_ATGGAATGAA 0 0 0.69897 5.0e+000 -1 -1
#> m4_CTTTMATRT 0 0 0.69897 5.0e+000 -1 -1
#> m5_RMSARCCAACGV 0 0 0.69897 5.0e+000 -1 -1
#> train_pos_count train_neg_count train_log_pval train_pval
#> m1_AGAGCAGHVA 8 0 -3.44705 0.00036
#> m2_CTCTTTCCCHTCT 8 0 -3.44705 0.00036
#> m3_ATGGAATGAA 7 0 -2.81023 0.0015
#> m4_CTTTMATRT 7 0 -2.81023 0.0015
#> m5_RMSARCCAACGV 7 0 -2.81023 0.0015
#> train_dtc train_bernoulli npassing is_palindromic elapsed_time
#> m1_AGAGCAGHVA -1 -1 8 no 0.3
#> m2_CTCTTTCCCHTCT -1 -1 8 no 0.1
#> m3_ATGGAATGAA -1 -1 7 no 0.4
#> m4_CTTTMATRT -1 -1 7 no 0.2
#> m5_RMSARCCAACGV -1 -1 7 no 0.5
#> total_sites
#> m1_AGAGCAGHVA 8
#> m2_CTCTTTCCCHTCT 8
#> m3_ATGGAATGAA 7
#> m4_CTTTMATRT 7
#> m5_RMSARCCAACGV 7
#> site_distr
#> m1_AGAGCAGHVA 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
#> m2_CTCTTTCCCHTCT 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#> m3_ATGGAATGAA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> m4_CTTTMATRT 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> m5_RMSARCCAACGV 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> max_sites site_hist
#> m1_AGAGCAGHVA 2 0 5 3
#> m2_CTCTTTCCCHTCT 1 0 8
#> m3_ATGGAATGAA 1 0 7
#> m4_CTTTMATRT 1 0 7
#> m5_RMSARCCAACGV 2 0 5 2
#>
#> [Hidden empty columns: family, organism, bkgsites, qval, eval.]
streme_results %>%
to_list() %>%
view_motifs()
Created on 2023-11-30 with reprex v2.0.2
Also, for some reason I can't reopen this issues, should I create a new one?
Hey @seb-mueller, I think this is actually an artifact of the rendering process for these motifs. For the ones where the logo doesn't match the IUPAC name, you'll see that they're shown as [RC]
(the m2_CTCTTTCCCHTCT
logo in the final example, for instance), and if you reverse-complement the displayed logo, you'll see it does indeed match the IUPAC.
Or did I misunderstand your issue?
Good spot, I didn't realize it could report it that way, at least that explains it for DNA Sequences. However I only constructed those for easy reproducibility, I spoted the actual issue with AA Sequences, where a reverse complement makes less sense. I only have a screenshot at hand, but happy to provide a reproducible example if needed. You can see that for instance m5 or m8 etc don't match up. Somehow it tries to do an RC which is not really working on AA sequences.
(streme_results <- runStreme(aaseqs, control = "shuffle", e = 50, alph = "protein", nmotifs = 10))
# motif name altname consensus alphabet strand icscore nsites pval type pseudocount bkg width initial_width seed score_threshold
# m1_PWRKNACT <mot:m1_P..> m1_PWRKNACT STREME-1 PWRKNACT AA +- 22.97021 18 0.24 PCM 0 0.0963, .... 8 6 PWRKNACT 0.480661
# m1m0_ICFWCPAN <mot:m2_N..> m2_NANANAVQD STREME-2 NANANAVQD AA +- 29.34581 8 0.50 PCM 0 0.0963, .... 9 6 NAAANAMQL 0.747902
# m2_NANANAVQD <mot:m3_F..> m3_FWIKLFNA STREME-3 FWIKLFNA AA +- 22.65608 20 1.00 PCM 0 0.0963, .... 8 4 FWIKLFNA 4.46709
# m3_FWIKLFNA <mot:m4_T..> m4_TNAYNAIC STREME-4 TNAYNAIC AA +- 20.52898 19 1.00 PCM 0 0.0963, .... 8 4 TNAYNAIC 6.10298
# m4_TNAYNAIC <mot:m5_D..> m5_DNASNWTI STREME-5 DNASNWTI AA +- 18.40615 12 1.00 PCM 0 0.0963, .... 8 5 DNASNWTI 14.1228
# m5_DNASNWTI <mot:m6_N..> m6_NALVHRBA STREME-6 NALVHRNA AA +- 22.68425 10 1.00 PCM 0 0.0963, .... 8 4 NALVHRDI 3.17051
# m6_NALVHRBA <mot:m7_N..> m7_NAPLHQRR STREME-7 NAPLHQRR AA +- 21.81291 10 1.00 PCM 0 0.0963, .... 8 5 NAPLHQRR 14.0807
# m7_NAPLHQRR <mot:m8_W..> m8_WNARHNFT STREME-8 WNARHNFT AA +- 22.90196 7 1.00 PCM 0 0.0963, .... 8 5 WNARHNST 12.9031
# m8_WNARHNFT <mot:m9_H..> m9_HQCSTFVH STREME-9 HQCSTFVH AA +- 26.52898 5 1.00 PCM 0 0.0963, .... 8 4 HQCSTLVG 0.909532
# m9_HQCSTFVH <mot:m1m0..> m1m0_ICFWCPAN STREME-10 ICFWCPAN AA +- 27.72584 5 1.00 PCM 0 0.0963, .... 8 4 ICFWSPFK 6.28344
# pos_count neg_count log_pval log_evalue evalue dtc bernoulli train_pos_count train_neg_count train_log_pval train_pval train_dtc train_bernoulli npassing
# m1_PWRKNACT 2 0 -0.625541 0.374459 2.4e+000 -1 0.5 16 0 -4.81563 1.5e-05 -1 0.500066 18
# m1m0_ICFWCPAN 1 0 -0.30103 0.69897 5.0e+000 -1 0.5 7 0 -2.10681 0.0078 -1 0.500066 8
# m2_NANANAVQD 0 0 0 1 1.0e+001 -1 0.5 20 0 -6.01954 9.6e-07 -1 0.500066 20
# m3_FWIKLFNA 0 0 0 1 1.0e+001 -1 0.5 19 0 -5.71857 1.9e-06 -1 0.500066 19
# m4_TNAYNAIC 0 0 0 1 1.0e+001 -1 0.5 12 0 -3.61167 0.00024 -1 0.500066 12
# m5_DNASNWTI 0 0 0 1 1.0e+001 -1 0.5 10 0 -3.00973 0.00098 -1 0.500066 10
# m6_NALVHRBA 0 0 0 1 1.0e+001 -1 0.5 10 0 -3.00973 0.00098 -1 0.500066 10
# m7_NAPLHQRR 0 0 0 1 1.0e+001 -1 0.5 7 0 -2.10681 0.0078 -1 0.500066 7
# m8_WNARHNFT 0 1 0 1 1.0e+001 -1 0.5 5 0 -1.50486 0.031 -1 0.500066 6
# m9_HQCSTFVH 0 0 0 1 1.0e+001 -1 0.5 5 0 -1.50486 0.031 -1 0.500066 5
# is_palindromic elapsed_time total_sites site_distr max_sites
# m1_PWRKNACT n/a 0.2 16 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 2
# m1m0_ICFWCPAN n/a 0.3 7 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 2
# m2_NANANAVQD n/a 0.4 20 0 0 1 0 0 0 0 0 1 1 2 0 1 0 1 1 1 1 0 0 0 0 0 2 0 0 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 2
# m3_FWIKLFNA n/a 0.1 19 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 2 0 1 0 0 0 1 0 1 1 2 0 0 0 0 0 1
# m4_TNAYNAIC n/a 0.1 12 0 0 1 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1
# m5_DNASNWTI n/a 0.5 10 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 2
# m6_NALVHRBA n/a 0.2 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 1
# m7_NAPLHQRR n/a 0.3 7 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
# m8_WNARHNFT n/a 0.6 5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
# m9_HQCSTFVH n/a 0.5 5 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1
# site_hist
# m1_PWRKNACT 0 14 2
# m1m0_ICFWCPAN 0 6 1
# m2_NANANAVQD 0 19 1
# m3_FWIKLFNA 0 19
# m4_TNAYNAIC 0 12
# m5_DNASNWTI 0 8 2
# m6_NALVHRBA 0 10
# m7_NAPLHQRR 0 7
# m8_WNARHNFT 0 5
# m9_HQCSTFVH 0 5
streme_results %>%
to_list() %>%
view_motifs()
Oh that's bizzare. Can you upload the .xml file output? You may need to rerun your code but set outdir
to a location you can access on your drive. This should have been fixed in https://github.com/snystrom/memes/commit/1a3eeefe50cf9ebe88139294a9a26fa88945a410.
Sure, I had to suffix a txt to the xml to be able to upload: streme.xml.txt Also, here is a screenshot from the html file:
Thanks! I'll take a look this evening.
Woah. @seb-mueller I found the issue. Bioconductor's upstream copy of memes
is not in sync with this repo. If you compare: https://code.bioconductor.org/browse/memes/blob/RELEASE_3_18/R/streme.R on bioc to the one from this repo: https://github.com/snystrom/memes/blob/master/R/streme.R you'll see clean_streme_ids()
is missing.... I swear I upstreamed these changes, but maybe I forgot??
Also observe the commit history at the bioc mirror: https://code.bioconductor.org/browse/memes/commits/RELEASE_3_18
Regardless, I'll try to get the repo polished up & merged w/ the bioc mirror. Sorry for the hiccup. In the mean time, I guess you can install the github version using remotes::install_github
. Ugh, what a mess.
Thanks so much @snystrom ! Nice and quickly solved, cheers!
Heads up, when using
runStreme
the names of motif objects within the output dataframe do not always match the actual identified motif.This doesn't seem to be the case for
runDreme
output.Environment:
R 4.2.2
memes 1.6.0