Closed widdowquinn closed 5 years ago
Currently the NCBI FASTA importer assumes "Genus species" is two words, https://github.com/peterjc/thapbi-pict/blob/a2daece9decc46404eb1cdc1091a1550ca0949e7/thapbi_pict/ncbi.py#L39
Currently the legacy FASTA importer assumes after removing the clade (first word) and accession (last word) we are left with the Genus and species but this does include some extra information sometimes:
Some might be legitimate multiword species, e.g. Phytophthora sp. aff.meadii but others do seem to have unwanted text.
For framing: why I think our focus needs to be at species level.
In pyani
I had a similar issue, extracting taxonomy data from genome assemblies with Bio.Entrez
. The bulk of the logic is in two functions:
https://github.com/widdowquinn/pyani/blob/development/pyani/download.py#L211
Here, we obtain the Entrez.esummary
of a specific assembly in the assembly
database, identified by UID; this contains the "SpeciesName"
field, which is in the format Genus species
.
I have not checked if an analogous process will work for the NCBI results when querying for ITS1 sequences.
For importing FASTA files, we can use the NCBI taxonomy dump (#9) to pre-load known Genus/species names. Or, do this with on the fly NCBI Entrez lookups.
Then during import as follows. e.g. Importing legacy entry Phytophthora_ilicis_voucher_P3939
we would match to genus Phytophthora, species, ilicis (which is NCBI taxid 89335), and also record the original slug Phytophthora_ilicis_voucher_P3939
for provenance. In this case there are no NCBI defined sub-entries.
TODO: Add new taxonomy table field for this original free text
My view is that the simplest routes are:
It is worth noting that a number of apparent species names should be considered examples of the species "unclassified Phytophthora", per this example. This causes potential issues for both our database and classifier.
We can modify our database with revised/more certain taxonomic assigments as time goes by, with updated versions of the tool.
OK, so species rank entries under "unclassified Phytophthora" would not get on our white-list - they would all appear under genus "Phytophthora", species "unclassified", which is NCBI taxid 211524
i.e. Rather than looking at all species rank entries, we probably do want to focus on the direct children of each genus of interest (initially just Phytophthora).
At a future date we'd likely do the same with Peronospora etc, where again there is a large unclassified group:
Confirming my comment from yesterday, as per work on #20, we get the desired species entries by taking the direct children of each genus of interest in Phytophthora. See d2a2471e18bf5fbee5ff9b55fb4a73b3e90cec54 and the example in the commit comment running at the Peronosporales level:
$ thapbi_pict load-tax -d sqlite:///:memory: -v -a 4776 -t new_taxdump_2018-12-01 > /dev/null
Loaded 2033846 nodes from nodes.dmp
Loaded 2033846 scientific names from names.dmp
Identified 24 genera under specified ancestor node: Baobabopsis, Basidiophora, Benua, Bremia, Calycofera, Crypticola, Eraphthora, Graminivora, Hyaloperonospora, Nothophytophthora, Novotelnova, Paraperonospora, Perofascia, Peronosclerospora, Peronospora, Phytophthora, Plasmopara, Plasmoverna, Protobremia, Pseudoperonospora, Salisapilia, Sclerophthora, Sclerospora, Viennotia
WARNING: Treating no rank 'unclassified Phytophthora' (txid211524) as a species.
WARNING: Treating no rank 'unclassified Peronospora' (txid376156) as a species.
WARNING: Treating no rank 'Hyaloperonospora parasitica species group' (txid453155) as a species.
WARNING: Treating no rank 'environmental samples' (txid660914) as a species.
WARNING: Treating no rank 'unclassified Pseudoperonospora' (txid1000571) as a species.
WARNING: Treating no rank 'unclassified Bremia' (txid1697568) as a species.
Filtered down to 580 species names
That works as intended for the unclassified groups, and in the absence of an "unclassified Hyaloperonospora" entry, seems reasonable for the Hyaloperonospora cases too ("Hyaloperonospora parasitica species group" and "environmental samples"),
A couple of interesting corner cases from validating the species names in the NCBI FASTA import test,
The simple plan to pre-load current NCBI taxonomy names, and use that for validation of an import, simply will not cope with these and other synonyms.
For NCBI searches, the current approach of running the Entrez search at the command line and capturing it as FASTA, then importing that, is too limited. Here we could call Entrez direct as a future enhancement.
It may be that the taxonomy dump is not the best way to compile the whitelist.
The NCBI taxonomy database does have the synonym information, e.g.:
from Bio import Entrez
Entrez.email = "kenny.loggins@danger.zone"
handle = Entrez.efetch(db="taxonomy", id="127444") . # P. undulata
record = Entrez.read(handle)[0]
print(record["OtherNames"]["Synonym"])
which gives:
['Elongisporangium undulatum', 'Pythium undulata', 'Pythium undulatum']
Now, we could in principle identify synonyms for each taxID in the dump, and create a dictionary mapping each synonym onto a single taxID and preferred scientific name (record["ScientificName"]
in the above). We will, because that's how taxonomy is, encounter cases where a synonym is not a 1:1 mapping to taxID/scientific name. But, at least in that case we can explain exactly why the original record is ambiguous and that we should not be incorporating it into a reference database if we can't even tell what it's meant to be…
I think that
is a workable approach.
Moved synonyms to issue #21, this looks like a minor corner case with the current datasets.
Merged #22 which gets a long way towards closing this.
Still need to tweak handling of unmatched sequences.
One option is to fall back to any genus level unclassified species entry like taxid 211524, unclassified Phytophthora, with a final fall back of taxid 12908, unclassified sequences (e.g. don't know the genus, or the genus has not got an unclassified entry).
Leighton's suggestion to simply not import those sequences which fail taxonomic matching (e.g. species name not in the NCBI yet, or falls under the unclassified node so we don't trust it) would be simpler to implement.
An advantage of not importing the sequences whose taxonomy we don't recognise is that it puts the responsibility for classification on the scientist using the tool.
Taxonomy is sometimes a fraught topic, with many grey area and uncertainties (often not easy to codify logically), and the more we "force" the scientists to be clear and specific with their inputs, rather than have an imperfect tool take its "best guess", the happier I think we'll all be with the output.
OK, in species validation mode the FASTA import has been updated on #23 to ignore/discard ITS1 entries where the species name was not already in the DB.
We've still got #21 open as an enhancement for dealing with synonyms, and #11 for a direct NCBI search-and-import command.
For accurate identification of sequence sources (e.g. when extracting a database for "all cinnamomi") we need to be able to identify all sequences that "belong" to "cinnamomi".
With a simple string search, this is possible if the
species
field for a sequence is or contains "cinnamomi" (but not all occurrences of "cinnamomi" may be a reliable indicator that the sequence is a Phytophthora cinnamomi) - however, accession or pathovar information is not part of the species name, and should not strictly be present in thespecies
field. We should split species name from access/pathovar etc. wherever possible.With a NCBI Taxonomy ID search, we can unambiguously identify whether the recorded taxonomy ID is part of the subtree with a node at a required species. For example, a search for Phytophthora cinnamomi could request taxon ID 4785. This is a parent to four subnodes (it is actually grandparent to one of these), as can be seen here.
In this instance, for sequences deriving from these recorded organisms, we could record:
The uniting features when we want to pull a sub-database at species level (or use species-level concordance to estimate sequence reliability) are "Species: cinnamomi" or "taxon ID is in subtree of 4785". The first of these is easier to implement as a preliminary filter (though identifying subtrees is not hard once the taxon ID tree is loaded).
A related question is what to do with the variant/accession information. At the moment, my view is that this is extraneous information that should be kept. Anything after the species name is of relatively minor importance for our classifier beyond identifying the originating organism for a sequence.
It is not always obvious how to parse "species" specifically out of binomial nomenclature, especially for hand-maintained files. One approach that may be useful is:
Alternatively, taking the first word that follows the genus name, and assuming that binomial nomenclature holds, is a good first pass.
Neither of these two approaches is robust to typos.