ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

downstream possible bug #44

Closed sckott closed 4 years ago

sckott commented 5 years ago

via https://github.com/ropensci/taxize/issues/727

x <- downstream("Bacteria", db = "ncbi", downto="species")
#> Error in name2taxid(x[is_named], db = "ncbi") :
#>.   Some of the input names are ambiguous, try setting out_type to 'summary'

but also

x <- downstream("Bacteria", db = "ncbi", downto="species", out_type = "summary")
#> Error in name2taxid(x[is_named], db = "ncbi") :
#>.   Some of the input names are ambiguous, try setting out_type to 'summary'

not sure what problem is yet

arendsee commented 5 years ago

Hmm, the out_type argument does not do what the documentation says. These commands both produce the same output:

taxizedb::downstream("Brassica", db = "ncbi", downto="species", out_type = "uid")
taxizedb::downstream("Brassica", db = "ncbi", downto="species", out_type = "summary")
arendsee commented 5 years ago

OK, I think I know what is wrong. I'll fix it and add some new tests.

arendsee commented 5 years ago

The name2taxid function is called internally by downstream to convert the given vector of taxa names into taxonomy IDs. The output table is a named list containing tables of downsteam taxa. The list names exactly match the given names. For example, taxizedb::downstream("PiG") will return a list with the name "PiG" for the first (and only) element. The problem with names that map to more than one taxa -- like "Bacteria", which is both the Kingdom of bacteria and the Genus of walking sticks -- is that they would result in non-unique names in the list.

One solution would be to give up on making output list names match input names. The output could be a list with names corresponding to the unique taxonomy IDs. The indices would then not be stable, since the output list could be longer than the input list of names. This would also force everyone to use IDs, even when names would fine except in pathological cases.

Another "solution" is to not fix the problem at all, but just make a better error message. Then if any of the input names are ambiguous, taxizedb dies with a clear error message. The user would then have to resolve the ambiguities (possibly replacing the names with IDs themselves).

I'm inclined towards the death-on-ambiguity "solution".

@sckott thoughts?

PS. The reason we get the cryptic message about the "summary" option is because no arguments are actually passed internally to taxid2name, where the error actually is raised. So taxid2name always uses the "uid" value for out_type.

sckott commented 5 years ago

I agree that we should stop with useful message with ambiguity.

thanks on the ps

arendsee commented 5 years ago

OK, how does this look:

> downstream("Bacteria")
Error in ncbi_apply(src, x, FUN, missing = missing, ...) : 
  The following taxa map to multiple taxonomy ids: 
     Bacteria  -  2|629395
arendsee commented 5 years ago

I see taxize has a cool interactive feature where you can choose between ambiguous taxonomy IDs. Probably we should add that for taxizedb also at some point, though I don't have time to spare at the moment.

sckott commented 5 years ago

that change looks good

sckott commented 5 years ago

can you add a test for that

sckott commented 5 years ago

right, we do have a prompt feature when there's more than 1 option. although we do have a few escape hatches to make things easier for programmatic use:

arendsee commented 5 years ago

@sckott I added a little test that asserts an error is raised on ambiguous names.

The prompt handling in taxize all sounds legit. Maybe we can leave this issue open until we have something similar for taxizedb? I can't work on it right now, but I should have some time to spare mid-February (after my defense).

sckott commented 5 years ago

thanks for adding the test.

Okay, sounds good.