Clean up species names; `species` column should contain only species name

widdowquinn commented 5 years ago

For accurate identification of sequence sources (e.g. when extracting a database for "all cinnamomi") we need to be able to identify all sequences that "belong" to "cinnamomi".

With a simple string search, this is possible if the species field for a sequence is or contains "cinnamomi" (but not all occurrences of "cinnamomi" may be a reliable indicator that the sequence is a Phytophthora cinnamomi) - however, accession or pathovar information is not part of the species name, and should not strictly be present in the species field. We should split species name from access/pathovar etc. wherever possible.

With a NCBI Taxonomy ID search, we can unambiguously identify whether the recorded taxonomy ID is part of the subtree with a node at a required species. For example, a search for Phytophthora cinnamomi could request taxon ID 4785. This is a parent to four subnodes (it is actually grandparent to one of these), as can be seen here.

In this instance, for sequences deriving from these recorded organisms, we could record:

Genus: Phytophthora; Species: cinnamomi; taxon ID: 4785
Genus: Phytophthora; Species: cinnamomi; variant: var. cinnamomi; taxon ID: 622258
Genus: Phytophthora; Species: cinnamomi; variant: var. cinnamomi; accession: CBS 144.22; taxon ID: 1048749
Genus: Phytophthora; Species: cinnamomi; variant: var. parvispora; taxon ID: 134458
Genus: Phytophthora; Species: cinnamomi; variant: var. robiniae; taxon ID: 708041

The uniting features when we want to pull a sub-database at species level (or use species-level concordance to estimate sequence reliability) are "Species: cinnamomi" or "taxon ID is in subtree of 4785". The first of these is easier to implement as a preliminary filter (though identifying subtrees is not hard once the taxon ID tree is loaded).

A related question is what to do with the variant/accession information. At the moment, my view is that this is extraneous information that should be kept. Anything after the species name is of relatively minor importance for our classifier beyond identifying the originating organism for a sequence.

It is not always obvious how to parse "species" specifically out of binomial nomenclature, especially for hand-maintained files. One approach that may be useful is:

Identify parent (e.g. Phytophthora) genus ID
Identify all first-level children of this parent: these are (mostly) species
Make a list of all the first-level child species names
Use this list as a check/validation/query to identify the likely species name (possible rule: genus name followed by a valid species name from this list)

Alternatively, taking the first word that follows the genus name, and assuming that binomial nomenclature holds, is a good first pass.

Neither of these two approaches is robust to typos.

peterjc commented 5 years ago

Currently the NCBI FASTA importer assumes "Genus species" is two words, https://github.com/peterjc/thapbi-pict/blob/a2daece9decc46404eb1cdc1091a1550ca0949e7/thapbi_pict/ncbi.py#L39

Currently the legacy FASTA importer assumes after removing the clade (first word) and accession (last word) we are left with the Genus and species but this does include some extra information sometimes:

https://github.com/peterjc/thapbi-pict/blob/a2daece9decc46404eb1cdc1091a1550ca0949e7/thapbi_pict/legacy.py#L150

Some might be legitimate multiword species, e.g. Phytophthora sp. aff.meadii but others do seem to have unwanted text.

widdowquinn commented 5 years ago

For framing: why I think our focus needs to be at species level.

we are tasked with providing a classifier that is able to identify the presence of distinct Phytophthora species. We do not need to track at sub-species level.
legislatively, our targets are defined as species of interest, and not at a lower level.

widdowquinn commented 5 years ago

In pyani I had a similar issue, extracting taxonomy data from genome assemblies with Bio.Entrez. The bulk of the logic is in two functions:

https://github.com/widdowquinn/pyani/blob/f44b040810fb2094191572f8b1fc97c14262cbe2/pyani/download.py#L193

https://github.com/widdowquinn/pyani/blob/development/pyani/download.py#L211

Here, we obtain the Entrez.esummary of a specific assembly in the assembly database, identified by UID; this contains the "SpeciesName" field, which is in the format Genus species.

I have not checked if an analogous process will work for the NCBI results when querying for ITS1 sequences.

peterjc commented 5 years ago

For importing FASTA files, we can use the NCBI taxonomy dump (#9) to pre-load known Genus/species names. Or, do this with on the fly NCBI Entrez lookups.

Then during import as follows. e.g. Importing legacy entry Phytophthora_ilicis_voucher_P3939 we would match to genus Phytophthora, species, ilicis (which is NCBI taxid 89335), and also record the original slug Phytophthora_ilicis_voucher_P3939 for provenance. In this case there are no NCBI defined sub-entries.

TODO: Add new taxonomy table field for this original free text

widdowquinn commented 5 years ago

My view is that the simplest routes are:

for FASTA files, test the binomial against a pre-approved NCBI taxonomy whitelist
for direct NCBI downloads, interrogate the NCBI database for the appropriate field

widdowquinn commented 5 years ago

It is worth noting that a number of apparent species names should be considered examples of the species "unclassified Phytophthora", per this example. This causes potential issues for both our database and classifier.

species assignments under "unclassified Phytophthora" are tentative and essentially self-declare as unreliable classifications. We should not use these when generating definitive databases, e.g. for a species (though for Phytophthora as a genus they're probably safe)
as "unclassified Phytophthora" is the "species name" (though it does not have species rank in the NCBI taxonomy), we will need to either:
- check against the 'grandchildren names'
- treat anything not at the species level in the whitelist as "unclassified Phytophthora"
when training or validating the classifier, we should exclude these sequences, and assume we do not know their true class.

We can modify our database with revised/more certain taxonomic assigments as time goes by, with updated versions of the tool.

peterjc commented 5 years ago

OK, so species rank entries under "unclassified Phytophthora" would not get on our white-list - they would all appear under genus "Phytophthora", species "unclassified", which is NCBI taxid 211524

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=4777&lvl=3&lin=f&keep=1&srchmode=1&unlock

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=211524&lvl=3&lin=f&keep=1&srchmode=1&unlock

i.e. Rather than looking at all species rank entries, we probably do want to focus on the direct children of each genus of interest (initially just Phytophthora).

At a future date we'd likely do the same with Peronospora etc, where again there is a large unclassified group:

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=376156&lvl=3&lin=f&keep=1&srchmode=1&unlock

peterjc commented 5 years ago

Confirming my comment from yesterday, as per work on #20, we get the desired species entries by taking the direct children of each genus of interest in Phytophthora. See d2a2471e18bf5fbee5ff9b55fb4a73b3e90cec54 and the example in the commit comment running at the Peronosporales level:

$ thapbi_pict load-tax -d sqlite:///:memory: -v -a 4776 -t new_taxdump_2018-12-01 > /dev/null
Loaded 2033846 nodes from nodes.dmp
Loaded 2033846 scientific names from names.dmp
Identified 24 genera under specified ancestor node: Baobabopsis, Basidiophora, Benua, Bremia, Calycofera, Crypticola, Eraphthora, Graminivora, Hyaloperonospora, Nothophytophthora, Novotelnova, Paraperonospora, Perofascia, Peronosclerospora, Peronospora, Phytophthora, Plasmopara, Plasmoverna, Protobremia, Pseudoperonospora, Salisapilia, Sclerophthora, Sclerospora, Viennotia
WARNING: Treating no rank 'unclassified Phytophthora' (txid211524) as a species.
WARNING: Treating no rank 'unclassified Peronospora' (txid376156) as a species.
WARNING: Treating no rank 'Hyaloperonospora parasitica species group' (txid453155) as a species.
WARNING: Treating no rank 'environmental samples' (txid660914) as a species.
WARNING: Treating no rank 'unclassified Pseudoperonospora' (txid1000571) as a species.
WARNING: Treating no rank 'unclassified Bremia' (txid1697568) as a species.
Filtered down to 580 species names

That works as intended for the unclassified groups, and in the absence of an "unclassified Hyaloperonospora" entry, seems reasonable for the Hyaloperonospora cases too ("Hyaloperonospora parasitica species group" and "environmental samples"),

peterjc commented 5 years ago

A couple of interesting corner cases from validating the species names in the NCBI FASTA import test,

Pythium undulatum aka Phytophthora undulata https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=127444&lvl=3&lin=f&keep=1&srchmode=1&unlock
Peronophythora litchii aka Phytophthora litchii https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=100870&lvl=3&lin=f&keep=1&srchmode=1&unlock

The simple plan to pre-load current NCBI taxonomy names, and use that for validation of an import, simply will not cope with these and other synonyms.

For NCBI searches, the current approach of running the Entrez search at the command line and capturing it as FASTA, then importing that, is too limited. Here we could call Entrez direct as a future enhancement.

widdowquinn commented 5 years ago

It may be that the taxonomy dump is not the best way to compile the whitelist.

The NCBI taxonomy database does have the synonym information, e.g.:

from Bio import Entrez

Entrez.email = "kenny.loggins@danger.zone"

handle = Entrez.efetch(db="taxonomy", id="127444") . # P. undulata
record = Entrez.read(handle)[0]
print(record["OtherNames"]["Synonym"])

which gives:

['Elongisporangium undulatum', 'Pythium undulata', 'Pythium undulatum']

Now, we could in principle identify synonyms for each taxID in the dump, and create a dictionary mapping each synonym onto a single taxID and preferred scientific name (record["ScientificName"] in the above). We will, because that's how taxonomy is, encounter cases where a synonym is not a 1:1 mapping to taxID/scientific name. But, at least in that case we can explain exactly why the original record is ambiguous and that we should not be incorporating it into a reference database if we can't even tell what it's meant to be…

I think that

obtain taxdump
extract taxIDs of interest from taxdump
create dictionary mapping scientific names and synonyms to (taxID, scientific name) pairs where unambiguous mapping is possible
compare candidate sequences to keys in the dictionary

is a workable approach.

peterjc commented 5 years ago

Moved synonyms to issue #21, this looks like a minor corner case with the current datasets.

peterjc commented 5 years ago

Merged #22 which gets a long way towards closing this.

Still need to tweak handling of unmatched sequences.

One option is to fall back to any genus level unclassified species entry like taxid 211524, unclassified Phytophthora, with a final fall back of taxid 12908, unclassified sequences (e.g. don't know the genus, or the genus has not got an unclassified entry).

Leighton's suggestion to simply not import those sequences which fail taxonomic matching (e.g. species name not in the NCBI yet, or falls under the unclassified node so we don't trust it) would be simpler to implement.

widdowquinn commented 5 years ago

An advantage of not importing the sequences whose taxonomy we don't recognise is that it puts the responsibility for classification on the scientist using the tool.

Taxonomy is sometimes a fraught topic, with many grey area and uncertainties (often not easy to codify logically), and the more we "force" the scientists to be clear and specific with their inputs, rather than have an imperfect tool take its "best guess", the happier I think we'll all be with the output.

peterjc commented 5 years ago

OK, in species validation mode the FASTA import has been updated on #23 to ignore/discard ITS1 entries where the species name was not already in the DB.

We've still got #21 open as an enhancement for dealing with synonyms, and #11 for a direct NCBI search-and-import command.

peterjc / thapbi-pict

Clean up species names; `species` column should contain only species name #17