stephenturner / biorecap

Retrieve and summarize bioRxiv preprints with a local LLM using ollama
https://stephenturner.github.io/biorecap/
Other
47 stars 4 forks source link

Error in parsing ollamar::generate with ollama>1.2.1 #1

Closed huyvuong closed 3 weeks ago

huyvuong commented 3 weeks ago

Thank for developing a very helpful package. I got the following error when trying to run

pp <- get_preprints(subject=c("bioinformatics", "genomics", "synthetic_biology")) |> add_prompt() |> add_summary(model="llama2:latest")

Error in dplyr::mutate(): ℹ In argument: summary = as.vector(...). Caused by error in ollamar::generate(model = model, prompt = x, output = "text")$response: ! $ operator is invalid for atomic vectors Run rlang::last_trace() to see where the error occurred.

When I run the str(ollamar::generate(model = "llama2:latest", prompt = "I am giving you a paper's title and abstract. Summarize the paper in as many sentences as I instruct. Do not include any preamble text to the summary just give me the summary with no preface or intro sentence.\nNumber of sentences in summary: 2\nTitle: PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings\nAbstract: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity. We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp. The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.", output = "text"))

I got a character vector returned. chr "Sure, here is a summary of the paper \"PHIStruct: Improving phage-host interaction prediction at low sequence s"| truncated

I believe the error is due to the return value from ollamar::generate is not a list.

sessionInfo() R version 4.3.1 (2023-06-16 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 11 x64 (build 22621)

Matrix products: default

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ollamar_1.2.1.9000 biorecap_0.1.0

loaded via a namespace (and not attached): [1] jsonlite_1.8.8 dplyr_1.1.4 compiler_4.3.1 crayon_1.5.3 Rcpp_1.0.13 tidyselect_1.2.1 [7] xml2_1.3.6 callr_3.7.6 jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0 R6_2.5.1
[13] generics_0.1.3 curl_5.0.2 httr2_1.0.3 knitr_1.48 tibble_3.2.1 desc_1.4.3
[19] bslib_0.8.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4 cachem_1.1.0 xfun_0.47
[25] sass_0.4.9 cli_3.6.3 withr_3.0.1 magrittr_2.0.3 ps_1.7.7 digest_0.6.37
[31] processx_3.8.4 rstudioapi_0.15.0 remotes_2.4.2.1 rappdirs_0.3.3 anytime_0.3.9 lifecycle_1.0.4
[37] tidyRSS_2.0.7 vctrs_0.6.5 evaluate_0.24.0 glue_1.7.0 pkgbuild_1.4.4 fansi_1.0.6
[43] purrr_1.0.2 httr_1.4.7 rmarkdown_2.28 tools_4.3.1 pkgconfig_2.0.3 htmltools_0.5.8.1

stephenturner commented 3 weeks ago

Hmm... it's working for me when I run it today.

pp <-
  get_preprints(subject=c("bioinformatics", "genomics", "synthetic_biology")) |>
  add_prompt() |>
  add_summary(model="llama2:latest")
> pp
# A tibble: 90 × 6
   subject        title                                                                                                                         url   abstract prompt summary
   <chr>          <chr>                                                                                                                         <chr> <chr>    <chr>  <chr>  
 1 bioinformatics "PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein em… http… Recent … "I am… "Sure!…
 2 bioinformatics "BGC Atlas: A Web Resource for Exploring the Global Chemical Diversity Encoded in Bacterial Genomes"                          http… Seconda… "I am… "Sure!…
 3 bioinformatics "Enrichment analysis for spatial and single-cell metabolomics accounting for molecular ambiguity"                             http… Imaging… "I am… "Sure!…
 4 bioinformatics "SpatialLeiden - Spatially-aware Leiden clustering"                                                                           http… Cluster… "I am… "Sure!…
 5 bioinformatics "Protein sequence classification using natural language processing techniques"                                                http… Protein… "I am… "1. Th…
 6 bioinformatics "A multi-omics approach to identify deleterious mutations in plants"                                                          http… Crops l… "I am… "Here …
 7 bioinformatics "Automatic crystal identification for crystallography: a comparison between direct methods and artificial intelligence strat… http… Crystal… "I am… "Sure!…
 8 bioinformatics "Solving the \"Blind men and the elephant problem\": Additive deep learning of complex high dimensional models from partial … http… Biologi… "I am… "Sure!…
 9 bioinformatics "Evolutionary mismatch between nuclear and mitochondrial genomes does not promote reversion mutations in mtDNA."              http… Serrano… "I am… "Sure!…
10 bioinformatics "Predicting small-molecule inhibition of protein complexes"                                                                   http… Protein… "I am… "Here …
# ℹ 80 more rows
# ℹ Use `print(n = ...)` to see more rows

When I run it with this specific example, there's no issue:

pp <- tibble::tibble(title="PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings", abstract="Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity. We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp. The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.")
pp <- pp |> add_prompt() |> add_summary()
pp$summary
[1] "The PHIStruct model takes in structure-aware embeddings of receptor-binding proteins generated via the SaProt language model to predict the host from among the ESKAPEE genera. The model exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings compared to recent tools."
pp$prompt
[1] "I am giving you a paper's title and abstract. Summarize the paper in as many sentences as I instruct. Do not include any preamble text to the summary just give me the summary with no preface or intro sentence. Number of sentences in summary: 2 Title: PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings Abstract: Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity. We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7% to 9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5% to 6% increase over BLASTp. The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct."

Let's see what happens with this prompt:

theprompt <- pp$prompt
modelres <- ollamar::generate(model="llama2:latest", prompt = theprompt, output="text")
class(modelres)
[1] "tbl_df"     "tbl"        "data.frame"
> modelres
# A tibble: 1 × 3
  model         response                                                                                                                                           created_at
  <chr>         <chr>                                                                                                                                              <chr>     
1 llama2:latest "1. PHIStruct is a machine learning model that predicts the host of a phage based on its receptor-binding proteins, using structure-aware protein… 2024-08-2…
> 
modelres$response
[1] "1. PHIStruct is a machine learning model that predicts the host of a phage based on its receptor-binding proteins, using structure-aware protein embeddings generated by SaProt.\n2. Compared to other tools, PHIStruct exhibits the best balance of precision and recall across a range of confidence thresholds and sequence similarity settings."
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biorecap_0.1.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        httr_1.4.7         cli_3.6.3          knitr_1.48         rlang_1.1.4       
 [6] xfun_0.46          ollamar_1.1.1      purrr_1.0.2        generics_0.1.3     jsonlite_1.8.8    
[11] glue_1.7.0         anytime_0.3.9      htmltools_0.5.8.1  tidyRSS_2.0.7      rappdirs_0.3.3    
[16] fansi_1.0.6        rmarkdown_2.27     evaluate_0.24.0    tibble_3.2.1       fastmap_1.2.0     
[21] tinytable_0.3.0.33 lifecycle_1.0.4    httr2_1.0.2        compiler_4.4.1     dplyr_1.1.4       
[26] Rcpp_1.0.13        pkgconfig_2.0.3    rstudioapi_0.16.0  digest_0.6.36      R6_2.5.1          
[31] tidyselect_1.2.1   utf8_1.2.4         curl_5.2.1         pillar_1.9.0       magrittr_2.0.3    
[36] withr_3.0.1        tools_4.4.1        xml2_1.3.6    
stephenturner commented 3 weeks ago

Ok, nevermind, I see the same thing when I upgrade to ollamar 1.2.1. I think the fix is easy here.

stephenturner commented 3 weeks ago

Thanks for the report @huyvuong. Fixed via #2. Reinstall and it should work. Let me know.

huyvuong commented 3 weeks ago

Thank you for the fix. It worked after reinstalling.