automatically pick a result when multiple hits returned? classification()

toczydlowski commented 2 years ago

I am trying to run classification() in a loop on a computing cluster to look up lineage info for a large list of species ish names. Right now I am getting an error on the cluster that based on all of my debugging has to do with the entries that return multiple hits - and then pop up a prompt to enter the number for the hit you want to select. Is it possible to have classification() automatically pick the first hit when multiple hits are found? Or do you have another suggestion for bypassing this issue when running automatically on a cluster? Thanks!

sckott commented 2 years ago

@zachary-foster maybe i'm not remembering but I think rows is what you want https://docs.ropensci.org/taxize/reference/classification.html#arguments - and then you can programmatically decide what to do with results

zachary-foster commented 2 years ago

Yea, rows will allow you to select which result before the query. For example:

classification('Asterina', db = 'ncbi')
classification('Asterina', db = 'ncbi', rows = 1)

Although, you might be looking for a fungus and get a starfish : )

toczydlowski commented 2 years ago

ah, thanks! yes this does what I want it to. I think for now I'll go with rows = 1, knowing I might be getting some weird results by just blindly always picking first row. there will be more hands-on QC in this case downstream so I think this fix will work. thanks team!

salix-d commented 2 years ago

I don't know why these filters aren't integrated but to have the genus of the right division you can do :

taxize::classification(taxize:::get_uid('Asterina', division_filter = "ascomycete fungi")[1], "ncbi")
taxize::classification(taxize:::get_uid('Asterina', division_filter = "starfish")[1], "ncbi")

There are similar functions to get ids for each databases. The filters vary by functions (different APIs).

zachary-foster commented 2 years ago

@salix-d Good observation. I will try to look into making that an option for taxize::classification

salix-d commented 1 year ago

For some reason in the classification function, ncbi is the only one to not have ... as an argument in id <- process_ids(sci_id, db, get_uid, rows = rows). Just by adding that, we could then use division_filter from classification.

Although the argument's name changes between db and idk that all dbs have that option (I know itis doesn't). For bold you can use division and rank. For gbif you can use kingdom, phylum, ..., genus but it still returns more than one results, but it does make sure that the one you want is at the top of the list so you can feel more confident using row = 1.

zachary-foster commented 1 year ago

Yea, I have been wanting to go through and make them all more consistent. Ideally even combine all the get_* functions into a single get_id_from_name function or something like that. In the mean time I add the ... to classification for NCBI like you suggested:

library(taxize)
taxize::classification('Asterina', division_filter = "ascomycete fungi", db = "ncbi")
#> ══  1 queries  ═══════════════
#> 
#> Retrieving data for taxon 'Asterina'
#> ✔  Found:  Asterina
#> ══  Results  ═════════════════
#> 
#> • Total: 1 
#> • Found: 1 
#> • Not Found: 0
#> $Asterina
#>                              name         rank      id
#> 1              cellular organisms      no rank  131567
#> 2                       Eukaryota superkingdom    2759
#> 3                    Opisthokonta        clade   33154
#> 4                           Fungi      kingdom    4751
#> 5                         Dikarya   subkingdom  451864
#> 6                      Ascomycota       phylum    4890
#> 7                  saccharomyceta        clade  716545
#> 8                  Pezizomycotina    subphylum  147538
#> 9                    leotiomyceta        clade  716546
#> 10                 dothideomyceta        clade  715962
#> 11                Dothideomycetes        class  147541
#> 12 Dothideomycetes incertae sedis      no rank  159987
#> 13                    Asterinales        order 1619909
#> 14                   Asterinaceae       family  281108
#> 15                       Asterina        genus  859380
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"

^{Created on 2023-03-09 with reprex v2.0.2}

ropensci / taxize

automatically pick a result when multiple hits returned? classification() #890