snystrom / memes

An R interface to the MEME Suite
https://snystrom.github.io/memes/
Other
43 stars 5 forks source link

AME failed with input universalmotif list #101

Closed mikstapes closed 1 year ago

mikstapes commented 2 years ago

Hi,

I encountered an error running AME and would really appreciate any suggestions. As input, I have GRanges split into a BStringSetList by an mcol, and then run AME on a universalmotif list like so:

pks_by_seq <- resize(pks_mm39, width = 200, fix = 'center') %>% 
    split(mcols(.)$seq_clss) %>% 
    get_sequence(mm39_genome)

jaspar_motifs <- read_jaspar(here("motifs",
                                                  "JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt"))

test_ame <- runAme(pks_by_seq, database = jaspar_motifs)

I figured this was quite straightforward, but I can't seem to get the function running and check_meme_install() ran fine. Here's the traceback:

Loading primary sequences.
Creating control sequences by shuffling input sequences preserving 2-mers.
Not in partition maximization mode. Fixing partition at the number of primary sequences (1568).
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
FATAL: Template does not contain data section.

Error: Shell process had non-zero exit status.
Traceback:

1. runAme(pks_by_seq, database = heart_motifs)
2. runAme.BStringSetList(pks_by_seq, database = jaspar_motifs)
3. runAme.list(as.list(input), control, outdir, method, database, 
 .     meme_path, sequences, silent, ...)
4. purrr::map(input, runAme.default, control = control, outdir = outdir, 
 .     method = method, database = database, meme_path = meme_path, 
 .     sequences = sequences, silent = silent, ...)
5. .f(.x[[i]], ...)
6. ps_out %>% process_check_error(help_fun = ~{
 .     ame_help(command)
 . }, user_flags = cmdfun::cmd_help_parse_flags(user_flags) %>% 
 .     grep("shuffle", ., invert = TRUE, value = TRUE), flags_fun = ~{
 .     gsub("-", "_", .x)
 . }, default_help_fun = TRUE)
7. process_check_error(., help_fun = ~{
 .     ame_help(command)
 . }, user_flags = cmdfun::cmd_help_parse_flags(user_flags) %>% 
 .     grep("shuffle", ., invert = TRUE, value = TRUE), flags_fun = ~{
 .     gsub("-", "_", .x)
 . }, default_help_fun = TRUE)
8. usethis::ui_stop("Shell process had non-zero exit status.")

and my session info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: MarIuX64 2.0 GNU/Linux

Matrix products: default
BLAS/LAPACK: ~/miniconda3/envs/r4_env/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm39_1.4.3 BSgenome_1.62.0                   
 [3] seqPattern_1.26.0                  MotifDb_1.36.0                    
 [5] Biostrings_2.62.0                  XVector_0.34.0                    
 [7] universalmotif_1.12.4              memes_1.2.5                       
 [9] plyranges_1.14.0                   rtracklayer_1.54.0                
[11] GenomicFeatures_1.46.5             AnnotationDbi_1.56.2              
[13] Biobase_2.54.0                     GenomicRanges_1.46.1              
[15] GenomeInfoDb_1.30.1                IRanges_2.28.0                    
[17] S4Vectors_0.32.3                   BiocGenerics_0.40.0               
[19] effsize_0.8.1                      patchwork_1.1.1                   
[21] forcats_0.5.1                      stringr_1.4.0                     
[23] dplyr_1.0.8                        purrr_0.3.4                       
[25] readr_2.1.2                        tidyr_1.2.0                       
[27] tibble_3.1.6                       ggplot2_3.3.5                     
[29] tidyverse_1.3.1                    here_1.0.1                        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                uuid_1.0-3                 
  [3] backports_1.4.1             BiocFileCache_2.2.1        
  [5] repr_1.1.4                  splines_4.1.1              
  [7] BiocParallel_1.28.3         usethis_2.1.5              
  [9] digest_0.6.29               htmltools_0.5.2            
 [11] fansi_1.0.2                 magrittr_2.0.2             
 [13] memoise_2.0.1               tzdb_0.2.0                 
 [15] modelr_0.1.8                matrixStats_0.61.0         
 [17] R.utils_2.11.0              vroom_1.5.7                
 [19] prettyunits_1.1.1           colorspace_2.0-3           
 [21] blob_1.2.2                  rvest_1.0.2                
 [23] rappdirs_0.3.3              haven_2.4.3                
 [25] crayon_1.5.0                RCurl_1.98-1.6             
 [27] jsonlite_1.8.0              glue_1.6.2                 
 [29] gtable_0.3.0                zlibbioc_1.40.0            
 [31] DelayedArray_0.20.0         scales_1.1.1               
 [33] DBI_1.1.2                   Rcpp_1.0.8.2               
 [35] plotrix_3.8-2               progress_1.2.2             
 [37] bit_4.0.4                   httr_1.4.2                 
 [39] ellipsis_0.3.2              R.methodsS3_1.8.1          
 [41] pkgconfig_2.0.3             XML_3.99-0.9               
 [43] farver_2.1.0                ggseqlogo_0.1              
 [45] dbplyr_2.1.1                utf8_1.2.2                 
 [47] tidyselect_1.1.2            labeling_0.4.2             
 [49] rlang_1.0.2                 munsell_0.5.0              
 [51] cellranger_1.1.0            tools_4.1.1                
 [53] cachem_1.0.6                cli_3.2.0                  
 [55] generics_0.1.2              RSQLite_2.2.10             
 [57] broom_0.7.12                evaluate_0.15              
 [59] fastmap_1.1.0               yaml_2.3.5                 
 [61] processx_3.5.2              bit64_4.0.5                
 [63] fs_1.5.2                    KEGGREST_1.34.0            
 [65] splitstackshape_1.4.8       nlme_3.1-155               
 [67] R.oo_1.24.0                 xml2_1.3.3                 
 [69] biomaRt_2.50.3              brio_1.1.3                 
 [71] compiler_4.1.1              rstudioapi_0.13            
 [73] filelock_1.0.2              curl_4.3.2                 
 [75] png_0.1-7                   testthat_3.1.2             
 [77] reprex_2.0.1                stringi_1.7.6              
 [79] ps_1.6.0                    desc_1.4.1                 
 [81] lattice_0.20-45             IRdisplay_1.1              
 [83] Matrix_1.4-0                vctrs_0.3.8                
 [85] pillar_1.7.0                lifecycle_1.0.1            
 [87] data.table_1.14.2           bitops_1.0-7               
 [89] R6_2.5.1                    BiocIO_1.4.0               
 [91] KernSmooth_2.23-20          MASS_7.3-55                
 [93] assertthat_0.2.1            pkgload_1.2.4              
 [95] SummarizedExperiment_1.24.0 rprojroot_2.0.2            
 [97] rjson_0.2.21                withr_2.5.0                
 [99] GenomicAlignments_1.30.0    Rsamtools_2.10.0           
[101] GenomeInfoDbData_1.2.7      mgcv_1.8-39                
[103] hms_1.1.1                   grid_4.1.1                 
[105] cmdfun_1.0.2                IRkernel_1.3.0.9000        
[107] MatrixGenerics_1.6.0        pbdZMQ_0.3-7               
[109] lubridate_1.8.0             base64enc_0.1-3            
[111] restfulr_0.0.13            
snystrom commented 2 years ago

Hey I'll take a look at this soon, could you share a subset of your sequence data to help reproduce the issue? You can provide it to me privately at my email (listed in the DESCRIPTION file) if you don't want to share any of it publicly, or attach a subset to this GitHub issue. Ideally an example that includes all the different values of seq_clss.

May be able to reproduce without it but if not, that will be important for finding a fix.

mikstapes commented 2 years ago

hey thanks for the quick reply! I took a 10 seqs per seq_clss subset and saved it as an.rds file. You can find it here (it's a personal dropbox link). Looking forward to hearing what you'll find out!

snystrom commented 2 years ago

Thank you so much, that's going to save a ton of time. I'll get to it this evening.

snystrom commented 2 years ago

Which version of the meme suite are you running? ame --version on the commandline.

mikstapes commented 2 years ago

I'm running 5.3.3

On Tue 24. May 2022 at 4:54 PM, Spencer Nystrom @.***> wrote:

Which version of the meme suite are you running? ame --version on the commandline.

— Reply to this email directly, view it on GitHub https://github.com/snystrom/memes/issues/101#issuecomment-1136035247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ6QY3WX7ZFMVDKNWEA24E3VLTUQZANCNFSM5WRSNMJA . You are receiving this because you authored the thread.Message ID: @.***>

-- Mai Phan

snystrom commented 2 years ago

Sorry for the delay on fixing this. I cannot reproduce this bug so far on my system running ame 5.3.3 or other versions of the meme suite using the test set of peaks you sent. Are the names of the groups the same in the example vs the real set you are using?

Are these the names you're using?

> names(pks_by_seq)
[1] "DC" "IC" "NC"

Also, can you confirm what jaspar DB file you are using? I used the one here (md5sum: 049663c08de5ebed565f82385b517342).

snystrom commented 2 years ago

To help me debug, could you rerun as follows:

test_ame <- runAme(pks_by_seq, database = jaspar_motifs, silent = FALSE)

And post the output?

snystrom commented 1 year ago

I'm going to close this issue, but feel free to reopen if you still have it & can provide more information.

NanoCoreUSA commented 1 year ago

I just wanted to report that I recently ran into this exact same error so the issue should likely be reopened -

hs.genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38

peaks <- readPeakFile("test.bed")

sequences <- peaks %>% 
     get_sequence(hs.genome)

ame_results <- runAme(sequences,
     database = "motif_databases.12.23/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant_v2.meme",
     control = "shuffle", 
     evalue_report_threshold = 30)

The test BED file only contains one line (wanted to make sure it wasn't a specific coordinate or something) -

chr1 5916170 5916210

I also tried multiple motif databases, including the one @snystrom linked above.

Error -

Added motif_databases.12.23/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant_v2.meme to motif_sources which now has 1 file names.
Motif file name is motif_databases.12.23/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant_v2.meme.
Writing results to output directory '/tmp/Rtmph6f4z3/file274aa4f83f04f_vs_shuffle'.
E-value threshold for reporting results: 30
Checking alphabets in 1 motif files.
Loading motifs from file 'motif_databases.12.23/JASPAR/JASPAR2022_CORE_vertebrates_non-redundant_v2.meme'
Loading primary sequences.
Creating control sequences by shuffling input sequences preserving 2-mers.
Not in partition maximization mode. Fixing partition at the number of primary sequences (1).
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
Bad file name.
FATAL: Template does not contain data section.

Error: Shell process had non-zero exit status.

Any suggestions?

snystrom commented 1 year ago

How did you install the meme suite? Are you using the conda version? This is a known issue with the conda version.

https://github.com/bioconda/bioconda-recipes/issues/22909