Closed huang-sh closed 2 years ago
Thank you for reporting this issue, and providing an excellent reproducible example! I think I know the problem & I'll try to push a fix in a couple hours.
Hey, so this is actually not a bug, but a subtle difference in how runTomTom
behaves on different input data types.
Where possible, memes
attempts to be endomorphic, that is, it tries to return the same data type as output that it received as input. So, when running runTomTom
on a list
, it will run a separate instance of the cli tomtom
for each list entry, import those data, and return them as a list (the output list order matches the input list order). When run while setting the outdir
argument on a list input, each subsequent call to tomtom
as it runs on each list entry will overwrite the contents of the output directory, which is why your first example only shows the hits for the final list object.
If you inpsect the results of your first example, you'll see it contains a list where each list entry is a univeralmotif_df
of the hits for each individual motif.
str(mmcmp, max.level = 1)
List of 4
$ :Classes ‘universalmotif_df’ and 'data.frame': 1 obs. of 27 variables:
$ :Classes ‘universalmotif_df’ and 'data.frame': 1 obs. of 19 variables:
$ :Classes ‘universalmotif_df’ and 'data.frame': 1 obs. of 19 variables:
$ :Classes ‘universalmotif_df’ and 'data.frame': 1 obs. of 19 variables:
You can generate a resulting data.frame with dplyr::bind_rows()
that should have all the TomTom metadata inside it.
dplyr::bind_rows(mmcmp)
# I truncated the data.frame for printing in this comment, but there's lots more data here!
motif name altname consensus alphabet strand icscore nsites eval type
1 <mot:1-AC..> 1-ACUUAC STREME-1 ACUUAC RNA + 9.081297 93 0.0013 PPM
2 <mot:2-CU..> 2-CUGCAGC STREME-2 CURCAGC RNA + 11.945943 164 0.0037 PPM
3 <mot:3-UU..> 3-UUUUUUUUUKUUU STREME-3 UUKUUKUUUKUUU RNA + 15.630323 121 0.0097 PPM
4 <mot:4-CU..> 4-CUUACCU STREME-4 CUUACCU RNA + 9.852394 411 0.0120 PPM
When run on a universalmotif data.frame or a path to a .meme
format file, runTomTom
will trigger a single run of the cli tomtom
that uses all the information in the data.frame or .meme
file. When triggered while setting outdir
, the tomtom
results files will therefore contain information for all hits in the input.
To demonstrate, compare your first example to the output of:
from_df <- runTomTom(
# Here I use the universalmotif function to_df() to convert the list to `universalmotif_df`
to_df(streme_motif),
database = mm_motif,
outdir = "tmp1",
evalue = TRUE,
silent = TRUE
)
For some memes
functions, this distinction is important, since using list input allows you to design differential enrichment testing paradigms, and can therefore affect how the analysis is performed. For tomtom
, it won't affect the results to use list vs data.frame input. So if you are using memes
in order to use the raw tomtom outputs later, I suggest converting to a universalmotif_df
first with to_df
before runTomTom
with the outdir
set.
I hope that clears things up. Happy to answer any other questions as well.
Again, thanks so much for providing a great example, it helps so much!
Cheers, -Spencer
Thanks for your clear answer! It helps me use memes
better.
There is anther question. In the software manual, I learned that runTomTom
return value' tomtom
list column stores the ranked list of possible matches to each motif. There are multiple matched motifs in tomtom.html
or tomtom.tsv
. Why mmcmp$tomtom
is inconsistent with tomtom.tsv
.
mmcmp <- memes::runTomTom(
"streme.txt",
database = mm_motif,
outdir = "tmp2",
evalue = TRUE,
silent = TRUE
)
> mmcmp$tomtom
[[1]]
match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1 M001_0.6 (A1cf)_(Homo_sapiens)_(RBD_0.92) <S4 class ‘universalmotif’ [package “universalmotif”] with 20 slots> 1 2 0.0689 5.31 0.663 *
[[2]]
match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1 <NA> <NA> NULL <NA> NA NA NA NA <NA>
[[3]]
match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1 <NA> <NA> NULL <NA> NA NA NA NA <NA>
[[4]]
match_name match_altname match_motif db_name match_offset match_pval match_eval match_qval match_strand
1 <NA> <NA> NULL <NA> NA NA NA NA <NA>
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /home/huangsh/software/anaconda/envs/R41/lib/libopenblasp-r0.3.15.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] memes_1.0.4 magrittr_2.0.1 universalmotif_1.10.2
loaded via a namespace (and not attached):
[1] fgsea_1.18.0 colorspace_2.0-1 ggtree_3.0.2 ellipsis_0.3.2 rprojroot_2.0.2 qvalue_2.24.0 XVector_0.32.0 GenomicRanges_1.44.0
[9] aplot_0.0.6 rstudioapi_0.13 farver_2.1.0 graphlayouts_0.7.1 ggrepel_0.9.1 bit64_4.0.5 scatterpie_0.1.6 AnnotationDbi_1.54.1
[17] fansi_0.5.0 xml2_1.3.2 splines_4.1.0 R.methodsS3_1.8.1 cachem_1.0.5 GOSemSim_2.18.0 polyclip_1.10-0 pkgload_1.2.1
[25] jsonlite_1.7.2 GO.db_3.13.0 png_0.1-7 R.oo_1.24.0 ggforce_0.3.3 BiocManager_1.30.16 readr_1.4.0 compiler_4.1.0
[33] httr_1.4.2 rvcheck_0.1.8 lazyeval_0.2.2 assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0 cli_2.5.0 tweenr_1.0.2
[41] tools_4.1.0 igraph_1.2.6 gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.6 reshape2_1.4.4 DO.db_2.9 dplyr_1.0.7
[49] fastmatch_1.1-0 Rcpp_1.0.7 enrichplot_1.12.2 Biobase_2.52.0 vctrs_0.3.8 Biostrings_2.60.1 ape_5.5 nlme_3.1-152
[57] ggseqlogo_0.1 ggraph_2.0.5 stringr_1.4.0 ps_1.6.0 testthat_3.0.3 lifecycle_1.0.0 clusterProfiler_4.0.2 DOSE_3.18.1
[65] zlibbioc_1.38.0 MASS_7.3-54 scales_1.1.1 tidygraph_1.2.0 hms_1.1.0 parallel_4.1.0 RColorBrewer_1.1-2 yaml_2.2.1
[73] memoise_2.0.0 gridExtra_2.3 ggplot2_3.3.4 downloader_0.4 stringi_1.6.2 RSQLite_2.2.5 S4Vectors_0.30.0 desc_1.3.0
[81] tidytree_0.3.4 BiocGenerics_0.38.0 BiocParallel_1.26.0 GenomeInfoDb_1.28.0 rlang_0.4.11 pkgconfig_2.0.3 matrixStats_0.59.0 bitops_1.0-7
[89] lattice_0.20-44 purrr_0.3.4 treeio_1.16.1 patchwork_1.1.1 shadowtext_0.0.8 cowplot_1.1.1 bit_4.0.4 processx_3.5.2
[97] tidyselect_1.1.1 plyr_1.8.6 R6_2.5.0 IRanges_2.26.0 generics_0.1.0 DBI_1.1.1 pillar_1.6.1 withr_2.4.2
[105] KEGGREST_1.32.0 RCurl_1.98-1.3 tibble_3.1.2 cmdfun_1.0.2 crayon_1.4.1 utf8_1.2.1 viridis_0.6.1 grid_4.1.0
[113] data.table_1.14.0 blob_1.2.2 digest_0.6.27 tidyr_1.1.3 R.utils_2.10.1 stats4_4.1.0 munsell_0.5.0 viridisLite_0.4.0
Oh boy. This is a bug. Confirmed on my end also. This may have to do with some undocumented changes made to the MEME
suite in the latest version... Either way, thank you for pointing it out!
This bug is now fixed in the development & release branch of memes
(version >= 1.2.3). You can wait for it to propagate on the bioconductor system in a few days and reinstall with biocManager::install("memes")
, or you can get the development version from github with:
remotes::install_github("snystrom/memes")
.
Thanks for reporting & helping to make the package better!
For future reference on my end, this bug was caused by incorrect use of the db
XML flag in the query & target sections of the XML. The db
flag actually referrs to the "query database" when used in the query
section, and the "target database" when used in the target
section. Originally, memes
reconstructed the hits by joining on the query_idx
-> target_idx
lookup table, while also including the db_idx
entry. In simple cases, this never caused issues because db_idx
is 0 for both query and targets, and in some of my tests using multiple databases they just happened to increment in sync and so worked fine. For tomtom
runs that use multiple databases like in this example, joining on db_idx
was unsyncing entries. I just dropped this and it fixed the errors.
Thanks for your explanation! It works well with the development version.
hi, thank you for providing the memes package.
I run
runTomTom
with universalmotifs list input and the output (tomtom.html, tomtom.xml, tomtom.tsv) only include the last motif comparision result. For example:There is one motif comparison result in the
tmp1/tomtom.tsv
:However, I can get all result when I input a file path.
in the
tmp2/tomtom.tsv
:streme.txt Mus_musculus.txt