waldronlab / bugphyzz

Harmonized annotation of microbial physiology
http://waldronlab.io/bugphyzz/
5 stars 5 forks source link

Merge PATRIC_IDs into Taxon_name and NCBI_ID? #193

Closed sdgamboa closed 1 year ago

sdgamboa commented 1 year ago

@lwaldron, @kbeckenrode, some PATRIC_IDs belong to the same Taxon. Should the annotations of these PATRIC_ID's be merged into a single entry (row)? The PATRIC_ID field/cell could be "1733.103, 1733.223". Note that we already use this format in some datasets with the Accession_ID column.

Example from the PATRIC's website and in bugphyzz (for some reason 1733.105 is missing in bugphyzz, maybe due to updates in PATRIC):

suppressMessages({
    library(bugphyzz)
    library(dplyr)
})

ar <- as_tibble(physiologies('antimicrobial resistance')[[1]])
#> Finished antimicrobial resistance
patric_ids <- c(
    '1733.103', '1733.223', '1733.105'
)

ar |> 
    filter(PATRIC_ID %in% patric_ids) |> 
    mutate(PATRIC_ID = as.character(PATRIC_ID)) |> 
    select(NCBI_ID, Taxon_name, PATRIC_ID) |> 
    distinct()
#> # A tibble: 2 × 3
#>   NCBI_ID Taxon_name                        PATRIC_ID
#>   <chr>   <chr>                             <chr>    
#> 1 unknown Mycobacterium tuberculosis G04046 1733.103 
#> 2 unknown Mycobacterium tuberculosis G04046 1733.223

Created on 2022-12-06 with reprex v2.0.2

sdgamboa commented 1 year ago

Note that this leads to duplicated annotations:

suppressMessages({
    library(bugphyzz)
    library(dplyr)
})

ar <- as_tibble(physiologies('antimicrobial resistance')[[1]])
#> Finished antimicrobial resistance
dup_rows <- which(duplicated(ar[,c('NCBI_ID', 'Taxon_name', 'Attribute')]))
length(dup_rows)
#> [1] 87

Created on 2022-12-06 with reprex v2.0.2

sdgamboa commented 1 year ago

All IDs are merged to the NCBI IDs, which are the ids of our data structure (NCBI tree).