rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

Using reticulate for a shiny app for semantic search with BERT from SentenceTransformers: Error in py_call_impl(callable, dots$args, dots$keywords) : NameError: name 'faiss' is not defined #1469

Open alicesaunders opened 1 year ago

alicesaunders commented 1 year ago

When running the code below I am repeatedly getting the following error: Error in py_call_impl(callable, dots$args, dots$keywords) : NameError: name 'faiss' is not defined

I have two scripts, app.R and pythonSemanticSearch.py. Some lines are commented out as they are alternative functions I have tried to use to see if it fixes the error but it has remained the same. My original code is using an index generated and saved by another script that is then being read in here (I will keep the code for this commented out but have replaced it with a different dataset and index generated within this code for reproducibility). A different dataset was used to generate this index and replaces the variable df in the example code below.

Here is the code from app.R:

 library(shiny)
  library(reticulate)

  use_python("my_env/Scripts/python.exe")

  sentence_transformers <- reticulate::import("sentence_transformers")
  SentenceTransformer <- sentence_transformers$SentenceTransformer

  for (i in 1:10) {
      gc(full = TRUE)
      system("nvidia-smi | grep MiB | grep Default")
      #model <- SentenceTransformer("trained_model_ALL_300")
      model <- SentenceTransformer('multi-qa-distilbert-cos-v1')
  }

  faiss <- reticulate::import("faiss")
  datasets <- reticulate::import(datasets) 
  load_dataset <- datasets$load_dataset
  ds = load_dataset('crime_and_punish', split='train[:100]')
  ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], return_tensors="pt"))[0][0].numpy()})
  ds_with_embeddings.add_faiss_index(column='embeddings')

  #faiss <- reticulate::import("faiss")
  #read_index <- faiss$read_index
  #index_path <- "trained_model_index_ALL_300.index"
  #index = read_index(index_path)

  #py_run_string("from sentence_transformers import SentenceTransformer")
  python <- import("pythonSemanticSearch") #import python script
  #python <- py_run_file("pythonSemanticSearch.py", local = TRUE)
  python$import_libraries() #import libraries from python function 

  #load df (from which the index was generated and the resulting dataframe needs to be based on)
  df = load_dataset('crime_and_punish')

  # Define UI for application 
  ui <- fluidPage(

      # Application title
      titlePanel("Semantic Search App"),

      # Sidebar with a slider input for number of bins 
      sidebarLayout(
          sidebarPanel(
              textInput(input = "query", "Enter your query:", ""),
              actionButton(input = "search", "Search")
          ),

          # Show a plot of the generated distribution
          mainPanel(
             tableOutput("results"),
             downloadButton(input = "downloadCSV", "Download CSV")
          )
      )
  )

  # Define server logic 
  server <- function(input, output) {

      # import the python module within the server logic 
      python <- import("pythonSemanticSearch")

      # import model 
      sentence_transformers <- reticulate::import("sentence_transformers")
      SentenceTransformer <- sentence_transformers$SentenceTransformer

      faiss <- reticulate::import("faiss")
      index_path <- "trained_model_index_ALL_300.index"
      index = faiss$read_index(index_path)

      for (i in 1:10) {
          gc(full = TRUE)
          system("nvidia-smi | grep MiB | grep Default")
          model <- SentenceTransformer("trained_model_ALL_300")
      }
      # attach to cpu 
      #model <- python$load_model(model)

      # import index 
      index <- python$load_index('trained_model_ALL_300.index')

      # check query not null and encode using BERT
      query_vector <- reactive({
          query <- input$query
          if (!is.null(query) && nchar(query)>0) {
             python$encode_query(query, model) # .py fcn 
          }
      })

      # generate search results 
      results <- eventReactive(input$search, {
          if (!is.null(query_vector())) {
              query_embedding <- query_vector()
              python$vector_search(input$query, query_embedding, model, index, df, num_results=10)
          }
      })

      # output table 
      output$results <- renderTable({
          results()
      })

      # download csv 
      output$downloadCSV <- downloadHandler(
          filename = function() {
              "semantic_search_results.csv"
          },
          content = function(file) {
              write.csv(results(), file)
          }
      )
  }

  # Run the app
  shinyApp(ui, server)

and here is the code from pythonSemanticSearch.py:

python script for the app - functions to read in the model

def import_libraries():
      """
      Import the required libraries.
      """
      import pandas as pd
      import glob
      import numpy as np
      import torch
      import faiss
      from pathlib import Path
      import csv
      from sentence_transformers import SentenceTransformer
      from sentence_transformers import InputExample, losses, datasets
      from tqdm import tqdm

  def load_model(model_name):
      """
      Load a SentenceTransformer model.

      Args:
          model_name (str): Name of the SentenceTransformer model.

      Returns:
          model: Loaded SentenceTransformer model.
      """
      #model = SentenceTransformer(model_name)

      # Check if GPU/CPU is available and use it
      if torch.cuda.is_available():
          model = model.to(torch.device("cuda"))
      print(model.device)

      return model

  def load_index(index_path):
      """
      Load a FAISS index.

      Args:
          index_path (str): Path to the FAISS index file.

      Returns:
          index: Loaded FAISS index.
      """
      index = faiss.read_index(index_path)
      return index

  def encode_query(query, model):
      """
      Encode a query using a SentenceTransformer model.

      Args:
          query (str): User query that should be more than a sentence long.
          model: Sentence-transformers model.

      Returns:
          vector (numpy.array): Encoded vector of the query.
      """
      vector = model.encode([query])
      return vector

  def vector_search(query, vector, model, index, df, num_results=10):
      """
      Transform the search query to a vector using a BERT model and find similar vectors using FAISS.
      Create a pandas DataFrame with the search results.

      Args:
          query (str): User query that should be more than a sentence long.
          model_name (str): Name of the SentenceTransformer model.
          index_path (str): Path to the FAISS index file.
          df: DataFrame containing report information.
          num_results (int): Number of results to return.

      Returns:
          results_df: Pandas DataFrame containing the results.
      """

      D, I = index.search(np.array(vector).astype("float32"), k=num_results)

      def id2details(df, I, column):
          return [list(df[df.UniqueID == idx][column]) for idx in I[0]]

      title = id2details(df, I, 'docname')
      text = id2details(df, I, 'paratext')

      data = {
          'Title': [item[0] for item in title], 
          'Text': [item[0] for item in text],
          'Search query': query
      }

      results_df = pd.DataFrame(data)
      return results_df 

Here is the output from: reticulate::py_config()

reticulate::py_config() python: C:/Users/Alice Saunders/Documents/Semantic Search Reticulate/my_env/Scripts/python.exe libpython: C:/Users/Alice Saunders/AppData/Local/Programs/Python/Python311/python311.dll pythonhome: C:/Users/Alice Saunders/Documents/Semantic Search Reticulate/my_env virtualenv: C:/Users/Alice Saunders/Documents/Semantic Search Reticulate/my_env/Scripts/activate_this.py version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] Architecture: 64bit numpy: C:/Users/Alice Saunders/Documents/Semantic Search Reticulate/my_env/Lib/site-packages/numpy numpy_version: 1.25.2 sentence_transformers:C:\Users\ALICES~1\DOCUME~1\SEMANT~1\my_env\Lib\site-packages\sentence_transformers

NOTE: Python version was forced by RETICULATE_PYTHON

here is the output from utils::SessionInfo()

R version 4.0.4 (2021-02-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] reticulate_1.24 shiny_1.6.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.8.3 rstudioapi_0.13 magrittr_2.0.1 rappdirs_0.3.3 xtable_1.8-4
[6] lattice_0.20-41 R6_2.5.0 rlang_0.4.10 fastmap_1.1.0 tools_4.0.4
[11] grid_4.0.4 png_0.1-7 jquerylib_0.1.3 withr_2.4.1 htmltools_0.5.1.1 [16] ellipsis_0.3.1 digest_0.6.27 lifecycle_1.0.0 crayon_1.4.1 Matrix_1.2-18
[21] later_1.1.0.1 sass_0.4.1 promises_1.2.0.1 cachem_1.0.4 mime_0.10
[26] compiler_4.0.4 bslib_0.2.4 jsonlite_1.7.2 httpuv_1.5.5

The model and index used are files that I have already generated in a previous script. A base model from SentenceTransformers can be used instead e.g. model <- SentenceTransformer('multi-qa-distilbert-cos-v1').

t-kalinowski commented 1 year ago

Hi @alicesaunders,

Can you please try to make your example smaller and something I can run locally to reproduce the error?

At a quick glance it looks like there is unmodified python code in the R script (e.g, usage of . and { in ds.map, etc.). Also, the python function import_libraries() seems to me coming from a misunderstanding of the difference in scoping rules between python and R; import does not make the package symbols globally available like library() does in R.

(This issue thread is more for reporting bugs than for support).