The function should take a character vector with 1 or more elements and try to infer the entity types of the elements using only string-matching based on the strings in the entity_lookup list below. The function should not use any component of the Benchling developer platform (the data warehouse or API).
Here is some pseudocode below:
# the names of this list are the first characters in the identifiers for the entities we are interested in.
# each element is a vector, the first element being a name for the entity type, the second is the API endpoint.
# note: I have listed the API endpoints to get a *single* entity, but the *bulk* endpoints will be used in downstream # functions.
entity_lookup <- list(
"plt_" = c("plate", "https://benchling.com/api/reference#/Plates/getPlate"),
"box_" = c("box", "https://benchling.com/api/reference#/Boxes/getBox"),
"con_" = c("container", "https://benchling.com/api/reference#/Containers/getContainer"),
"loc_" = c("location", "https://benchling.com/api/reference#/Locations/getLocation"),
"etr_" = c("entry", "https://benchling.com/api/reference#/Entries/getEntry"),
"bfi_" = c("custom_entity", "https://benchling.com/api/reference#/Custom%20Entities/getCustomEntity"),
"ent_" = c("user", "https://benchling.com/api/reference#/Users/getUser"),
"sfs_" = c("dropdown", "https://benchling.com/api/reference#/Dropdowns/getDropdown"),
"sfso_" = c("dropdown_option", "https://benchling.com/api/reference#/Dropdowns/getDropdown"), # the dropdown options are available from the `dropdown` endpoint, as well as the `dropdown_option` warehouse table.
"seq" = c("dna_sequence", NA), # both dna_oligo and dna_sequence types start with seq, so there isn't one endpoint. find these in the database in the `entity` table instead.
"mxt"= c("mixture", "https://benchling.com/api/reference#/Mixtures/getMixture"),
"container_batch" = c("container_content", "https://benchling.com/api/reference#/Containers/getContainerContent")
)
# use the entity_lookup to try to infer the schemas of 1 or more Benchling identifiers (entity_id)
infer_entity_type <- function(entity_id, entity_lookup) {
entity_tags <- strsplit(entity_id, split="_")
# see if any of the entity_tags match the names of entity_lookup
# return a vector with same length as length(entity_id), where the names of the vector
# are the original identifier and the values are either the type of entity or NA
}
Output of infer_entity_type should be a named character vector, where the names are the identifiers and values are the schema types.
In cases when the entity_id prefix can't be matched to anything we are aware of, return NA for that entity_id.
infer_entity_type(c("seq_Cuf0bmCm", "not_a_real_key"))
# seq_Cuf0bmCm not_a_real_key
# "dna_sequence" NA
Note, the infer_entity_type function only uses string matching to make a best guess at the entity type. It doesn't guarantee that the entity identifier is a valid entity -- that will happen in a different function. In the example below, the second entity ID isn't valid, but since it starts with seq_, the function will return "dna_sequence" for it.
The purpose of this function is to identify the items that are probably entities so that the queries to the API are as efficient as possible. The next function will take the IDs that likely correspond to known entity types and it will query the appropriate API endpoints to retrieve the information about the entities that do exist. I will provide more information on what that function should look like at a later date.
These are some test cases to start with:
# two valid identifiers of different schema types (DNA sequence and custom entity)
tc <- c("seq_Cuf0bmCm", "bfi_Q13AlXkf")
# one valid identifier of DNA sequence type
tc <- c("seq_Cuf0bmCm")
# one invalid identifier that looks like DNA sequence type
tc <- c("seq_Cuf0AAAA")
# two valid identifiers of the same schema type
tc <- c("bfi_Q13AlXkf", "bfi_VVamxrKQ")
# two valid identifiers of the same schema type and one invalid that looks like a different schema type
tc <- c("bfi_Q13AlXkf", "bfi_VVamxrKQ", "seq_Cuf0AAAA")
# one invalid identifier that doesn't follow a reasonable pattern
tc <- c("RQDFLKJ")
The function should take a character vector with 1 or more elements and try to infer the entity types of the elements using only string-matching based on the strings in the
entity_lookup
list below. The function should not use any component of the Benchling developer platform (the data warehouse or API).Here is some pseudocode below:
Output of
infer_entity_type
should be a named character vector, where the names are the identifiers and values are the schema types.In cases when the
entity_id
prefix can't be matched to anything we are aware of, returnNA
for thatentity_id
.Note, the
infer_entity_type
function only uses string matching to make a best guess at the entity type. It doesn't guarantee that the entity identifier is a valid entity -- that will happen in a different function. In the example below, the second entity ID isn't valid, but since it starts withseq_
, the function will return "dna_sequence" for it.The purpose of this function is to identify the items that are probably entities so that the queries to the API are as efficient as possible. The next function will take the IDs that likely correspond to known entity types and it will query the appropriate API endpoints to retrieve the information about the entities that do exist. I will provide more information on what that function should look like at a later date.
These are some test cases to start with: