Closed sdgamboa closed 1 year ago
In the example below, I think the taxid 287 should be annotated with the taxon name 'Pseudomonas aeruginosa' and with the attribute values 'resistance to levofloxacin' (rarely) and 'sensitive to levofloxacin' (usually).
So what I propose is to leave the NCBI_ID empty and change the parent rank with code to species (obtaining it from the current Genome_ID).
library(bugphyzz)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ar <- as_tibble(physiologies('antimicrobial resistance')[[1]])
#> Finished antimicrobial resistance
## With strains
ar |>
filter(NCBI_ID == '287') |>
count(NCBI_ID, Attribute, Rank)
#> # A tibble: 2 × 4
#> NCBI_ID Attribute Rank n
#> <int> <chr> <chr> <int>
#> 1 287 resistance to levofloxacin species 5
#> 2 287 sensitive to levofloxacin species 15
## Removing strain names with a regex
ar |>
filter(NCBI_ID == '287') |>
mutate(Taxon_name = sub('^(\\w+ \\w+).+', '\\1', Taxon_name)) |>
count(NCBI_ID, Taxon_name, Attribute)
#> # A tibble: 2 × 4
#> NCBI_ID Taxon_name Attribute n
#> <int> <chr> <chr> <int>
#> 1 287 Pseudomonas aeruginosa resistance to levofloxacin 5
#> 2 287 Pseudomonas aeruginosa sensitive to levofloxacin 15
## The species Pseudomonas aeruginosa is not annotated
ar |> filter(Taxon_name == 'Pseudomonas aeruginosa')
#> # A tibble: 0 × 14
#> # … with 14 variables: NCBI_ID <int>, Genome_ID <dbl>, Accession_ID <lgl>,
#> # Taxon_name <chr>, Attribute <chr>, Attribute_value <lgl>,
#> # Attribute_source <chr>, Evidence <chr>, Frequency <chr>, Rank <chr>,
#> # Parent_name <chr>, Parent_NCBI_ID <int>, Parent_rank <chr>,
#> # Confidence_in_curation <chr>
Created on 2022-09-09 with reprex v2.0.2
That makes sense to me. Thank you for catching this.
⚠️ This has been marked to be closed in 7 days.
⚠️ This has been marked to be closed in 7 days.
This has been addressed in the following commits:
d27aa94. Add new spreadsheet. This has been documented on bugphyzzWrangling. Script, Data, file on google sheets.
4fc8eb2. Modify the physiologies
function. If a dataset already has parent columns then it doesn't insert new parent columns when the data is being imported. This also affects other datasets such as length, width, and habitat.
Created on 2022-09-09 with reprex v2.0.2