ropensci / EDIutils

An API Client for the Environmental Data Initiative Repository
https://docs.ropensci.org/EDIutils/
Other
10 stars 2 forks source link

Draft `make_query` Function for R-Style Solr Queries #58

Open njlyon0 opened 7 months ago

njlyon0 commented 7 months ago

Summary

Hey EDIutils team! I had a conversation with Colin Smith and Greg Maurer recently about creating a make_query function to help make Solr queries for people with some R literacy but limited prior exposure to Solr. The hope is that this new function would make it easier for R users to make good use of EDIutils::search_data_packages.

I've taken a stab at this function and will attach the full code to this issue. Note that I also wrote two helper functions solr_wild and solrize to make the internal components of make_query as streamlined as possible. I'm definitely a novice to Solr queries so make_query may be missing crucial arguments but I think it's a reasonable starting point and is built to be semi-modular and could easily support additional arguments. All functions are written in base R (version 4.3.2).

Let me know if this doesn't work on your end and/or if you'd like me to make any changes before it could possibly be built into EDIutils. Thanks!

Function Demo Script

# Load needed libaries
library(EDIutils)

# Clear environment
rm(list = ls())

# Define helper function
## Swaps human equivalents of wildcards for Solr wildcard
solr_wild <- function(bit){

  # Handle empty `bit`
  if(is.null(bit) == TRUE){

    # Replace with wildcard
    bit_v2 <- "*"
  }

  # Handle English equivalents for wildcard
  else if(length(bit) == 1){

    # Replace allowed keywords with wildcard
    bit_v2 <- gsub(pattern = "all|any", replacement = "*", x = bit)
  } 

  # If neither condition is met, return whatever was originally supplied
  else { bit_v2 <- bit }

  # Return finished product
  return(bit_v2) }

# Example(s)
solr_wild(bit = NULL)
solr_wild(bit = "any")
solr_wild(bit = "something else")

# Define helper function
## Parses English text into Solr syntax (i.e., right delimiters, etc.)
solrize <- function(bit){

  # Replace spaces with hyphens
  bit_v2 <- gsub(pattern = " ", replacement = "-", x = bit)

  # If more than one value, handle that
  if(length(bit_v2) > 1){

    # Collapse with plus signs
    bit_v3 <- paste0("(", paste0(bit_v2, collapse = "+"), ")")

  } else { bit_v3 <- bit_v2 }

  # Return finished bit
  return(bit_v3) }

# Example(s)
solrize(bit = c("primary production", "plants"))

# Define function to generate query
make_query <- function(keywords = NULL, subjects = NULL, authors = NULL, 
                       scopes = NULL, excl_scopes = NULL, 
                       return_fields = "all", limit = 10){

  ## Error Checking ----
  # Define supported return 'return_fields'
  good_fields <- c("*", "all", "abstract", "begindate", "doi", "enddate", "funding", "geographicdescription", "id", "methods", "packageid", "pubdate", "responsibleParties", "scope", "site", "taxonomic", "title", "authors", "spatialCoverage", "sources", "keywords", "organizations", "singledates", "timescales")

  # Error out for unsupported ones
  if(all(return_fields %in% good_fields) != TRUE)
    stop("Unrecognized return field(s): ", 
         paste(base::setdiff(x = return_fields, y = good_fields), collapse = "; "))

  # Error out for non-numeric limit
  if(is.numeric(limit) != TRUE){
    message("`limit` must be numeric, coercing to 10")
    limit <- 10 }

  ## Solr Query Construction ----
  # Make start of query object
  query_v0 <- "q="

  # If keywords are provided:
  ### 1. Turn into Solr Syntax
  solr_kw <- solrize(bit = solr_wild(bit = keywords)) 

  ### 2. Add to query
  query_v1 <- paste0(query_v0, "keyword:", solr_kw)

  # Handle authors
  solr_aut <- solrize(bit = solr_wild(bit = authors))
  query_v2 <- paste0(query_v1, "&fq=", "author:", solr_aut)

  # Handle subjects
  solr_sub <- solrize(bit = solr_wild(bit = subjects))
  query_v3 <- paste0(query_v2, "&fq=", "subject:", solr_sub)

  # Handle scopes
  solr_scp <- solrize(bit = solr_wild(bit = scopes))
  query_v4 <- paste0(query_v3, "&fq=", "scope:", solr_scp)

  # EXCLUDED scopes
  ## Handled differently because don't want to swap `NULL` for wildcard
  if(is.null(excl_scopes) != TRUE){

    # Solr-ize
    solr_excl_scp <- solrize(bit = excl_scopes)

    # Add to query
    query_v5 <- paste0(query_v4, "&fq=", "-scope:", solr_excl_scp)

    # Or skip
  } else { query_v5 <- query_v4 }

  # Parse return fields
  ## Solr syntax for multiple entries differs here from other elements of query
  solr_fl <- paste(solr_wild(bit = return_fields), collapse=",")
  query_v6 <- paste0(query_v5, "&fl=", solr_fl)

  # Finally, assemble full query with row limit
  solr_query <- paste0(query_v6, "&rows=", limit)

  # Return that to the user
  return(solr_query) }

#  Invoke function
( request <- make_query(keywords = "*", 
                        scopes = "knb-lter-fce",
                        excl_scopes = c("ecotrends", "lter landsat"),
                        return_fields =  c("title", "authors", "id", "doi"),
                        limit = 10) )

# Test assembled query
EDIutils::search_data_packages(query = request)

# Test use of `make_query` inside of `search_data_packages`
EDIutils::search_data_packages(query = make_query(excl_scopes = "knb-lter-fce",
                                                  return_fields = c("title", "id")))
njlyon0 commented 7 months ago

Related Function

I just heard about the query function in the dataone package which seems like it could be a nice 'middle path' for constructing Solr queries (see here).

Users can create their own Solr queries (A) by hand/manually, (B) by supplying a named list that breaks queries into four chunks, or (C) by using something like the function I supplied above where each Solr parameter is mapped to a separate argument.

I'm biased but I think the mapping of each parameter to its own argument is novel enough (relative to dataone::query) that it still warrants inclusion as its own function but I wanted to point out that a similar function does already exist

clnsmth commented 7 months ago

This is great @njlyon0, thanks for the draft! I'll give it a test drive and return with some feedback.

clnsmth commented 7 months ago

Related to #36