Open toczydlowski opened 2 years ago
@zachary-foster maybe i'm not remembering but I think rows
is what you want https://docs.ropensci.org/taxize/reference/classification.html#arguments - and then you can programmatically decide what to do with results
Yea, rows
will allow you to select which result before the query. For example:
classification('Asterina', db = 'ncbi')
classification('Asterina', db = 'ncbi', rows = 1)
Although, you might be looking for a fungus and get a starfish : )
ah, thanks! yes this does what I want it to. I think for now I'll go with rows = 1, knowing I might be getting some weird results by just blindly always picking first row. there will be more hands-on QC in this case downstream so I think this fix will work. thanks team!
I don't know why these filters aren't integrated but to have the genus of the right division you can do :
taxize::classification(taxize:::get_uid('Asterina', division_filter = "ascomycete fungi")[1], "ncbi")
taxize::classification(taxize:::get_uid('Asterina', division_filter = "starfish")[1], "ncbi")
There are similar functions to get ids for each databases. The filters vary by functions (different APIs).
@salix-d Good observation. I will try to look into making that an option for taxize::classification
For some reason in the classification function, ncbi is the only one to not have ...
as an argument in id <- process_ids(sci_id, db, get_uid, rows = rows)
. Just by adding that, we could then use division_filter from classification.
Although the argument's name changes between db and idk that all dbs have that option (I know itis doesn't).
For bold you can use division and rank.
For gbif you can use kingdom, phylum, ..., genus but it still returns more than one results, but it does make sure that the one you want is at the top of the list so you can feel more confident using row = 1
.
Yea, I have been wanting to go through and make them all more consistent. Ideally even combine all the get_*
functions into a single get_id_from_name
function or something like that. In the mean time I add the ...
to classification
for NCBI like you suggested:
library(taxize)
taxize::classification('Asterina', division_filter = "ascomycete fungi", db = "ncbi")
#> ══ 1 queries ═══════════════
#>
#> Retrieving data for taxon 'Asterina'
#> ✔ Found: Asterina
#> ══ Results ═════════════════
#>
#> • Total: 1
#> • Found: 1
#> • Not Found: 0
#> $Asterina
#> name rank id
#> 1 cellular organisms no rank 131567
#> 2 Eukaryota superkingdom 2759
#> 3 Opisthokonta clade 33154
#> 4 Fungi kingdom 4751
#> 5 Dikarya subkingdom 451864
#> 6 Ascomycota phylum 4890
#> 7 saccharomyceta clade 716545
#> 8 Pezizomycotina subphylum 147538
#> 9 leotiomyceta clade 716546
#> 10 dothideomyceta clade 715962
#> 11 Dothideomycetes class 147541
#> 12 Dothideomycetes incertae sedis no rank 159987
#> 13 Asterinales order 1619909
#> 14 Asterinaceae family 281108
#> 15 Asterina genus 859380
#>
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"
Created on 2023-03-09 with reprex v2.0.2
I am trying to run classification() in a loop on a computing cluster to look up lineage info for a large list of species ish names. Right now I am getting an error on the cluster that based on all of my debugging has to do with the entries that return multiple hits - and then pop up a prompt to enter the number for the hit you want to select. Is it possible to have classification() automatically pick the first hit when multiple hits are found? Or do you have another suggestion for bypassing this issue when running automatically on a cluster? Thanks!