taylo5jm / benchlingr

An unofficial R package for working with Benchling.
https://benchling-r.info/
Other
12 stars 0 forks source link

Implement function to infer certain entity types using string matching #74

Open taylo5jm opened 1 year ago

taylo5jm commented 1 year ago

The function should take a character vector with 1 or more elements and try to infer the entity types of the elements using only string-matching based on the strings in the entity_lookup list below. The function should not use any component of the Benchling developer platform (the data warehouse or API).

Here is some pseudocode below:

# the names of this list are the first characters in the identifiers for the entities we are interested in.
# each element is a vector, the first element being a name for the entity type, the second is the API endpoint.
# note: I have listed the API endpoints to get a *single* entity, but the *bulk* endpoints will be used in downstream # functions.
entity_lookup <- list(
    "plt_" = c("plate", "https://benchling.com/api/reference#/Plates/getPlate"),
    "box_" = c("box", "https://benchling.com/api/reference#/Boxes/getBox"),
    "con_" = c("container", "https://benchling.com/api/reference#/Containers/getContainer"),
    "loc_" = c("location", "https://benchling.com/api/reference#/Locations/getLocation"),
    "etr_" = c("entry", "https://benchling.com/api/reference#/Entries/getEntry"),
    "bfi_" = c("custom_entity", "https://benchling.com/api/reference#/Custom%20Entities/getCustomEntity"),
    "ent_" = c("user", "https://benchling.com/api/reference#/Users/getUser"),
    "sfs_" = c("dropdown", "https://benchling.com/api/reference#/Dropdowns/getDropdown"),
    "sfso_" = c("dropdown_option", "https://benchling.com/api/reference#/Dropdowns/getDropdown"), # the dropdown options are available from the `dropdown` endpoint, as well as the `dropdown_option` warehouse table. 
    "seq" = c("dna_sequence", NA), # both dna_oligo and dna_sequence types start with seq, so there isn't one endpoint. find these in the database in the `entity` table instead.
    "mxt"= c("mixture", "https://benchling.com/api/reference#/Mixtures/getMixture"),
    "container_batch" = c("container_content", "https://benchling.com/api/reference#/Containers/getContainerContent")
  )

# use the entity_lookup to try to infer the schemas of 1 or more Benchling identifiers (entity_id)
infer_entity_type <- function(entity_id, entity_lookup) {
    entity_tags <- strsplit(entity_id, split="_") 
    # see if any of the entity_tags match the names of entity_lookup
    # return a vector with same length as length(entity_id), where the names of the vector
    # are the original identifier and the values are either the type of entity or NA
}

Output of infer_entity_type should be a named character vector, where the names are the identifiers and values are the schema types.

infer_entity_type(c("seq_Cuf0bmCm", "bfi_Q13AlXkf"))
# seq_Cuf0bmCm  bfi_Q13AlXkf
# "dna_sequence" "custom_entity"

In cases when the entity_id prefix can't be matched to anything we are aware of, return NA for that entity_id.

infer_entity_type(c("seq_Cuf0bmCm", "not_a_real_key"))
# seq_Cuf0bmCm  not_a_real_key
# "dna_sequence" NA

Note, the infer_entity_type function only uses string matching to make a best guess at the entity type. It doesn't guarantee that the entity identifier is a valid entity -- that will happen in a different function. In the example below, the second entity ID isn't valid, but since it starts with seq_, the function will return "dna_sequence" for it.

infer_entity_type(c("seq_Cuf0bmCm", "seq_ZZZZZZZZ"))
# seq_Cuf0bmCm  seq_ZZZZZZZZ
# "dna_sequence" "dna_sequence" 

The purpose of this function is to identify the items that are probably entities so that the queries to the API are as efficient as possible. The next function will take the IDs that likely correspond to known entity types and it will query the appropriate API endpoints to retrieve the information about the entities that do exist. I will provide more information on what that function should look like at a later date.

These are some test cases to start with:

# two valid identifiers of different schema types (DNA sequence and custom entity)
tc <- c("seq_Cuf0bmCm", "bfi_Q13AlXkf")

# one valid identifier of DNA sequence type
tc <- c("seq_Cuf0bmCm")

# one invalid identifier that looks like DNA sequence type
tc <- c("seq_Cuf0AAAA")

# two valid identifiers of the same schema type
tc <- c("bfi_Q13AlXkf", "bfi_VVamxrKQ")

# two valid identifiers of the same schema type and one invalid that looks like a different schema type
tc <- c("bfi_Q13AlXkf", "bfi_VVamxrKQ", "seq_Cuf0AAAA")

# one invalid identifier that doesn't follow a reasonable pattern
tc <- c("RQDFLKJ")
taylo5jm commented 1 year ago

This issue is necessary for #62