snystrom / memes

An R interface to the MEME Suite
https://snystrom.github.io/memes/
Other
43 stars 5 forks source link

runTomTom only output the last motif comparision result when input is a list of universalmotifs #92

Closed huang-sh closed 2 years ago

huang-sh commented 2 years ago

hi, thank you for providing the memes package.

I run runTomTom with universalmotifs list input and the output (tomtom.html, tomtom.xml, tomtom.tsv) only include the last motif comparision result. For example:

library(universalmotif)

mm_motif <- read_meme("Mus_musculus.meme")
streme_motif <- read_meme("streme.txt")

mmcmp <- memes::runTomTom(
    streme_motif,
    database = mm_motif,
    outdir = "tmp1",
    evalue = TRUE,
    silent = TRUE
    )

There is one motif comparison result in the tmp1/tomtom.tsv :

Query_ID    Target_ID   Optimal_offset  p-value E-value q-value Overlap Query_consensus Target_consensus    Orientation
4-CUUACCU   M207_0.6    1   0.00508867  0.391828    0.150498    7   CUUACCU CCUUUCCC    +
4-CUUACCU   M051_0.6    -1  0.00578618  0.445536    0.150498    6   CUUACCU AUACAUU +
4-CUUACCU   M188_0.6    1   0.0061343   0.472341    0.150498    7   CUUACCU CUUUCCCU    +
4-CUUACCU   M043_0.6    1   0.00781807  0.601991    0.150498    6   CUUACCU CCUUCCC +
4-CUUACCU   M046_0.6    1   0.0105168   0.809794    0.161959    6   CUUACCU ACUAACA +
4-CUUACCU   M227_0.6    0   0.0492017   3.78853 0.62855 7   CUUACCU CUUUUCU +
4-CUUACCU   M037_0.6    2   0.0571409   4.39985 0.62855 5   CUUACCU AGCUUGC +
4-CUUACCU   M026_0.6    1   0.0801958   6.17508 0.771885    6   CUUACCU CCAACCC +
4-CUUACCU   M036_0.6    -2  0.116329    8.95736 0.810544    5   CUUACCU AAUCUUG +

# Tomtom (Motif Comparison Tool): Version 5.3.3 compiled on Feb 28 2021 at 22:14:02
# The format of this file is described at https://meme-suite.org/meme/doc/tomtom-output-format.html.

However, I can get all result when I input a file path.

mmcmp <- memes::runTomTom(
    "streme.txt",
    database = mm_motif,
    outdir = "tmp2",
    evalue = TRUE,
    silent = TRUE
    )

in the tmp2/tomtom.tsv:

Query_ID    Target_ID   Optimal_offset  p-value E-value q-value Overlap Query_consensus Target_consensus    Orientation
1-ACUUAC    M046_0.6    0   0.00178624  0.13754 0.13754 6   ACUUAC  ACUAACA +
1-ACUUAC    M138_0.6    1   0.0109687   0.844593    0.422296    6   ACUUAC  AACUAAG +
1-ACUUAC    M085_0.6    1   0.0190601   1.46762 0.489208    6   ACUUAC  GAAUUAAG    +
1-ACUUAC    M037_0.6    1   0.0276895   2.13209 0.520927    6   ACUUAC  AGCUUGC +
1-ACUUAC    M169_0.6    0   0.0353363   2.72089 0.520927    6   ACUUAC  ACACACA +
1-ACUUAC    M051_0.6    2   0.0405917   3.12556 0.520927    5   ACUUAC  AUACAUU +
1-ACUUAC    M207_0.6    0   0.0665787   5.12656 0.663186    6   ACUUAC  CCUUUCCC    +
1-ACUUAC    M001_0.6    2   0.0689024   5.30549 0.663186    5   ACUUAC  AUAAUUG +
1-ACUUAC    M027_0.6    0   0.0917848   7.06743 0.704416    6   ACUUAC  ACACACA +
1-ACUUAC    M036_0.6    2   0.0917848   7.06743 0.704416    5   ACUUAC  AAUCUUG +
1-ACUUAC    M043_0.6    0   0.122124    9.40356 0.704416    6   ACUUAC  CCUUCCC +
1-ACUUAC    M002_0.6    2   0.122124    9.40356 0.704416    5   ACUUAC  AGACGUA +
2-CUGCAGC   M083_0.6    -2  0.00822484  0.633313    0.633313    5   CUGCAGC GCAGCGC +
2-CUGCAGC   M069_0.6    0   0.060873    4.68722 0.999915    7   CUGCAGC UUGCACA +
2-CUGCAGC   M140_0.6    1   0.106402    8.19297 0.999915    6   CUGCAGC CAGACAG +
2-CUGCAGC   M286_0.6    -1  0.125096    9.63241 0.999915    4   CUGCAGC UGUA    +
2-CUGCAGC   M056_0.6    -2  0.127204    9.79471 0.999915    5   CUGCAGC ACAACGA +
3-UUUUUUUUUKUUU M031_0.6    -6  3.31782e-05 0.00255472  0.00255472  7   UUUUUUUUUUUUU   UUAUUUU +
3-UUUUUUUUUKUUU M075_0.6    -4  0.000555323 0.0427599   0.0213799   7   UUUUUUUUUUUUU   UUUUUUG +
3-UUUUUUUUUKUUU M025_0.6    -6  0.00133759  0.102994    0.0343315   7   UUUUUUUUUUUUU   AUUUUUU +
3-UUUUUUUUUKUUU M227_0.6    -4  0.0022515   0.173365    0.0372784   7   UUUUUUUUUUUUU   CUUUUCU +
3-UUUUUUUUUKUUU M077_0.6    -4  0.00242067  0.186392    0.0372784   7   UUUUUUUUUUUUU   UUUUUUC +
3-UUUUUUUUUKUUU M012_0.6    -3  0.00300139  0.231107    0.0385178   7   UUUUUUUUUUUUU   CUUUUUU +
3-UUUUUUUUUKUUU M004_0.6    -4  0.0220431   1.69732 0.242474    7   UUUUUUUUUUUUU   UGUGUGU +
3-UUUUUUUUUKUUU M051_0.6    -6  0.108586    8.36111 0.929013    7   UUUUUUUUUUUUU   AUACAUU +
3-UUUUUUUUUKUUU M168_0.6    -2  0.108586    8.36111 0.929013    7   UUUUUUUUUUUUU   GUAGUGU +
4-CUUACCU   M207_0.6    1   0.00508867  0.391828    0.150498    7   CUUACCU CCUUUCCC    +
4-CUUACCU   M051_0.6    -1  0.00578618  0.445536    0.150498    6   CUUACCU AUACAUU +
4-CUUACCU   M188_0.6    1   0.0061343   0.472341    0.150498    7   CUUACCU CUUUCCCU    +
4-CUUACCU   M043_0.6    1   0.00781807  0.601991    0.150498    6   CUUACCU CCUUCCC +
4-CUUACCU   M046_0.6    1   0.0105168   0.809794    0.161959    6   CUUACCU ACUAACA +
4-CUUACCU   M227_0.6    0   0.0492017   3.78853 0.62855 7   CUUACCU CUUUUCU +
4-CUUACCU   M037_0.6    2   0.0571409   4.39985 0.62855 5   CUUACCU AGCUUGC +
4-CUUACCU   M026_0.6    1   0.0801958   6.17508 0.771885    6   CUUACCU CCAACCC +
4-CUUACCU   M036_0.6    -2  0.116329    8.95736 0.810544    5   CUUACCU AAUCUUG +

# Tomtom (Motif Comparison Tool): Version 5.3.3 compiled on Feb 28 2021 at 22:14:02
# The format of this file is described at https://meme-suite.org/meme/doc/tomtom-output-format.html.

streme.txt Mus_musculus.txt

snystrom commented 2 years ago

Thank you for reporting this issue, and providing an excellent reproducible example! I think I know the problem & I'll try to push a fix in a couple hours.

snystrom commented 2 years ago

Hey, so this is actually not a bug, but a subtle difference in how runTomTom behaves on different input data types.

Where possible, memes attempts to be endomorphic, that is, it tries to return the same data type as output that it received as input. So, when running runTomTom on a list, it will run a separate instance of the cli tomtom for each list entry, import those data, and return them as a list (the output list order matches the input list order). When run while setting the outdir argument on a list input, each subsequent call to tomtom as it runs on each list entry will overwrite the contents of the output directory, which is why your first example only shows the hits for the final list object.

If you inpsect the results of your first example, you'll see it contains a list where each list entry is a univeralmotif_df of the hits for each individual motif.

str(mmcmp, max.level = 1)
List of 4
 $ :Classes ‘universalmotif_df’ and 'data.frame':   1 obs. of  27 variables:
 $ :Classes ‘universalmotif_df’ and 'data.frame':   1 obs. of  19 variables:
 $ :Classes ‘universalmotif_df’ and 'data.frame':   1 obs. of  19 variables:
 $ :Classes ‘universalmotif_df’ and 'data.frame':   1 obs. of  19 variables:

You can generate a resulting data.frame with dplyr::bind_rows() that should have all the TomTom metadata inside it.

dplyr::bind_rows(mmcmp)

# I truncated the data.frame for printing in this comment, but there's lots more data here!

         motif            name  altname     consensus alphabet strand   icscore nsites   eval type
1 <mot:1-AC..>        1-ACUUAC STREME-1        ACUUAC      RNA      +  9.081297     93 0.0013  PPM
2 <mot:2-CU..>       2-CUGCAGC STREME-2       CURCAGC      RNA      + 11.945943    164 0.0037  PPM
3 <mot:3-UU..> 3-UUUUUUUUUKUUU STREME-3 UUKUUKUUUKUUU      RNA      + 15.630323    121 0.0097  PPM
4 <mot:4-CU..>       4-CUUACCU STREME-4       CUUACCU      RNA      +  9.852394    411 0.0120  PPM

When run on a universalmotif data.frame or a path to a .meme format file, runTomTom will trigger a single run of the cli tomtom that uses all the information in the data.frame or .meme file. When triggered while setting outdir, the tomtom results files will therefore contain information for all hits in the input.

To demonstrate, compare your first example to the output of:

from_df <- runTomTom(
    # Here I use the universalmotif function to_df() to convert the list to `universalmotif_df`
    to_df(streme_motif),
    database = mm_motif,
    outdir = "tmp1",
    evalue = TRUE,
    silent = TRUE
    )

For some memes functions, this distinction is important, since using list input allows you to design differential enrichment testing paradigms, and can therefore affect how the analysis is performed. For tomtom, it won't affect the results to use list vs data.frame input. So if you are using memes in order to use the raw tomtom outputs later, I suggest converting to a universalmotif_df first with to_df before runTomTom with the outdir set.

I hope that clears things up. Happy to answer any other questions as well.

Again, thanks so much for providing a great example, it helps so much!

Cheers, -Spencer

huang-sh commented 2 years ago

Thanks for your clear answer! It helps me use memes better.

There is anther question. In the software manual, I learned that runTomTom return value' tomtom list column stores the ranked list of possible matches to each motif. There are multiple matched motifs in tomtom.html or tomtom.tsv. Why mmcmp$tomtom is inconsistent with tomtom.tsv.

mmcmp <- memes::runTomTom(
    "streme.txt",
    database = mm_motif,
    outdir = "tmp2",
    evalue = TRUE,
    silent = TRUE
    )

> mmcmp$tomtom                                                                                                                                                                                               
[[1]]
  match_name                    match_altname                                                          match_motif db_name match_offset match_pval match_eval match_qval match_strand
1   M001_0.6 (A1cf)_(Homo_sapiens)_(RBD_0.92) <S4 class ‘universalmotif’ [package “universalmotif”] with 20 slots>       1            2     0.0689       5.31      0.663            *

[[2]]
  match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1       <NA>          <NA>        NULL    <NA>           NA         NA         NA         NA         <NA>

[[3]]
  match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1       <NA>          <NA>        NULL    <NA>           NA         NA         NA         NA         <NA>

[[4]]
  match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1       <NA>          <NA>        NULL    <NA>           NA         NA         NA         NA         <NA>
> sessionInfo()                                                                                                                                                                                              
R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/huangsh/software/anaconda/envs/R41/lib/libopenblasp-r0.3.15.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] memes_1.0.4           magrittr_2.0.1        universalmotif_1.10.2

loaded via a namespace (and not attached):
  [1] fgsea_1.18.0           colorspace_2.0-1       ggtree_3.0.2           ellipsis_0.3.2         rprojroot_2.0.2        qvalue_2.24.0          XVector_0.32.0         GenomicRanges_1.44.0  
  [9] aplot_0.0.6            rstudioapi_0.13        farver_2.1.0           graphlayouts_0.7.1     ggrepel_0.9.1          bit64_4.0.5            scatterpie_0.1.6       AnnotationDbi_1.54.1  
 [17] fansi_0.5.0            xml2_1.3.2             splines_4.1.0          R.methodsS3_1.8.1      cachem_1.0.5           GOSemSim_2.18.0        polyclip_1.10-0        pkgload_1.2.1         
 [25] jsonlite_1.7.2         GO.db_3.13.0           png_0.1-7              R.oo_1.24.0            ggforce_0.3.3          BiocManager_1.30.16    readr_1.4.0            compiler_4.1.0        
 [33] httr_1.4.2             rvcheck_0.1.8          lazyeval_0.2.2         assertthat_0.2.1       Matrix_1.3-4           fastmap_1.1.0          cli_2.5.0              tweenr_1.0.2          
 [41] tools_4.1.0            igraph_1.2.6           gtable_0.3.0           glue_1.4.2             GenomeInfoDbData_1.2.6 reshape2_1.4.4         DO.db_2.9              dplyr_1.0.7           
 [49] fastmatch_1.1-0        Rcpp_1.0.7             enrichplot_1.12.2      Biobase_2.52.0         vctrs_0.3.8            Biostrings_2.60.1      ape_5.5                nlme_3.1-152          
 [57] ggseqlogo_0.1          ggraph_2.0.5           stringr_1.4.0          ps_1.6.0               testthat_3.0.3         lifecycle_1.0.0        clusterProfiler_4.0.2  DOSE_3.18.1           
 [65] zlibbioc_1.38.0        MASS_7.3-54            scales_1.1.1           tidygraph_1.2.0        hms_1.1.0              parallel_4.1.0         RColorBrewer_1.1-2     yaml_2.2.1            
 [73] memoise_2.0.0          gridExtra_2.3          ggplot2_3.3.4          downloader_0.4         stringi_1.6.2          RSQLite_2.2.5          S4Vectors_0.30.0       desc_1.3.0            
 [81] tidytree_0.3.4         BiocGenerics_0.38.0    BiocParallel_1.26.0    GenomeInfoDb_1.28.0    rlang_0.4.11           pkgconfig_2.0.3        matrixStats_0.59.0     bitops_1.0-7          
 [89] lattice_0.20-44        purrr_0.3.4            treeio_1.16.1          patchwork_1.1.1        shadowtext_0.0.8       cowplot_1.1.1          bit_4.0.4              processx_3.5.2        
 [97] tidyselect_1.1.1       plyr_1.8.6             R6_2.5.0               IRanges_2.26.0         generics_0.1.0         DBI_1.1.1              pillar_1.6.1           withr_2.4.2           
[105] KEGGREST_1.32.0        RCurl_1.98-1.3         tibble_3.1.2           cmdfun_1.0.2           crayon_1.4.1           utf8_1.2.1             viridis_0.6.1          grid_4.1.0            
[113] data.table_1.14.0      blob_1.2.2             digest_0.6.27          tidyr_1.1.3            R.utils_2.10.1         stats4_4.1.0           munsell_0.5.0          viridisLite_0.4.0 
snystrom commented 2 years ago

Oh boy. This is a bug. Confirmed on my end also. This may have to do with some undocumented changes made to the MEME suite in the latest version... Either way, thank you for pointing it out!

snystrom commented 2 years ago

This bug is now fixed in the development & release branch of memes (version >= 1.2.3). You can wait for it to propagate on the bioconductor system in a few days and reinstall with biocManager::install("memes"), or you can get the development version from github with:

remotes::install_github("snystrom/memes").

Thanks for reporting & helping to make the package better!

snystrom commented 2 years ago

For future reference on my end, this bug was caused by incorrect use of the db XML flag in the query & target sections of the XML. The db flag actually referrs to the "query database" when used in the query section, and the "target database" when used in the target section. Originally, memes reconstructed the hits by joining on the query_idx -> target_idx lookup table, while also including the db_idx entry. In simple cases, this never caused issues because db_idx is 0 for both query and targets, and in some of my tests using multiple databases they just happened to increment in sync and so worked fine. For tomtom runs that use multiple databases like in this example, joining on db_idx was unsyncing entries. I just dropped this and it fixed the errors.

huang-sh commented 2 years ago

Thanks for your explanation! It works well with the development version.