ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

tracking lineage on wikidata db #79

Closed mdrishti closed 3 months ago

mdrishti commented 3 months ago

Hi,

I am trying to track the lineage of several wikidata taxon ids. The idea is to pick the ids that map to one kingdom (e.g.: plantae). I found that the classification function does not provide the option for wikidata db. So, I decided to use a graph traversal approach to track the lineage using wikidata_id and parent_id, but it turns out that there are several parent_id(s) for one wikidata_id, which makes it difficult to do graph traversal efficiently. Do you have any ideas how I can track the lineage, or if there is an obvious approach that I am missing here?

regards, DT

stitam commented 3 months ago

Thanks @mdrishti for raising this issue. Currently the classification function is only implemented for col, gbif, itis, ncbi and wfo, so not for wikidata, unfortunately. taxizedb downloads the database from zenodo and converts it to SQLite, so my guess is that this is an issue with the original publication: https://zenodo.org/record/1213477. Can you please provide an example where you see multiple parents?

mdrishti commented 3 months ago

Hi @stitam,

Following is a reproducible example: db_download_wikidata(verbose = TRUE, overwrite = FALSE) src <- src_wikidata() df.wikidata <- data.frame(tbl(src, "wikidata")) head(df.wikidata[grep("\\|",df.wikidata$parent_id),])

Gives me the following: <!DOCTYPE html>

wikidata_id | scientific_name | rank_id | parent_id |   |   -- | -- | -- | -- | -- | -- Q13060358 | Hattoria | Q34740 | Q1713397 \| Q186577 |   Q13076716 | Nothotsuga | Q34740 | Q2821616 \| Q101680 |   Q13169693 | Juliformia | Q5868144 | Q19891064 \| Q21061214 |   Q13403032 | Sardinops | Q34740 | Q2110160 \| Q27141 |   Q13418100 | Aphloiaceae | Q35409 | Q338878 \| Q902665 \| Q21860

I think there are 3883 such rows.

-DT

stitam commented 3 months ago

Thanks @mdrishti, I downloaded the original database and looked into the tsv, it seems these multi-parent instances come from the original database, so this is not a taxizedb thing.

I looked up the ids in the first row manually: Hattoria (Q13060358) is a genus and according to wikidata its parent taxons are both (!) Jungermanniaceae (Q1713397) (family) and Scapaniaceae (Q186577) (family).

For comparison, I looked up Hattoria in other databases:

NCBI:

taxizedb::classification(taxizedb::name2taxid("Hattoria"), db = "ncbi")
#> $`984535`
#>                  name         rank      id
#> 1  cellular organisms      no rank  131567
#> 2           Eukaryota superkingdom    2759
#> 3       Viridiplantae      kingdom   33090
#> 4        Streptophyta       phylum   35493
#> 5      Streptophytina    subphylum  131221
#> 6         Embryophyta        clade    3193
#> 7     Marchantiophyta        clade    3195
#> 8   Jungermanniopsida        class  186771
#> 9     Jungermanniidae     subclass  186782
#> 10    Jungermanniales        order    3199
#> 11     Cephaloziineae     suborder   71154
#> 12  Anastrophyllaceae       family 1131839
#> 13           Hattoria        genus  984535
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "ncbi"

Created on 2024-05-27 with reprex v2.1.0

GBIF:

taxizedb::classification(5286352, db = "gbif")
#> $`5286352`
#>                   name    rank      id
#> 1              Plantae kingdom       6
#> 2      Marchantiophyta  phylum       9
#> 3    Jungermanniopsida   class     126
#> 4      Jungermanniales   order     381
#> 5    Anastrophyllaceae  family    2292
#> 6 Hattoria R.M.Schust.   genus 5286352
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "gbif"

Created on 2024-05-27 with reprex v2.1.0

ITIS:

taxizedb::classification(14683, db = "itis")
#> $`14683`
#>                 name          rank      id
#> 1            Plantae       kingdom  202422
#> 2      Viridiplantae    subkingdom  954898
#> 3       Streptophyta  infrakingdom  846494
#> 4        Embryophyta superdivision  954900
#> 5    Marchantiophyta      division  846119
#> 6  Jungermanniopsida         class  846124
#> 7    Jungermanniidae      subclass   14198
#> 8    Jungermanniales         order   14210
#> 9     Cephaloziineae      suborder  846191
#> 10 Anastrophyllaceae        family 1107780
#> 11          Hattoria         genus   14683
#> 
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "itis"

Created on 2024-05-27 with reprex v2.1.0

These three seem to agree but none of them mention either Jungermanniaceae nor Scapaniaceae from wikidata. My guess is that a taxon can belong to multiple taxonomy schemes; some providers like GBIF use a single scheme so each taxon will have a single parent, other providers like wikidata may collect data from multiple schemes which can lead to ambiguity. Does this make sense?

The term Scapaniaceae seems to come from Phylogeny and classification of the Marchantiophyta and it seems Jungermanniaceae does not even have a reference. If you want to get to the bottom of this maybe you could reach out to some wikidata taxon authors?

Fun fact: When I just look at the wiki page for Hattoria (https://en.wikipedia.org/wiki/Hattoria) I see the same taxonomy which taxizedb extracts from GBIF. I wonder if you could use GBIF instead of wikidata?

mdrishti commented 3 months ago

Thanks @stitam . Yes, using GBIF was my last resort, mainly because there are more species in wikidata than GBIF. But, I guess for short-term that is what I will use now. Thanks for pointing to the original zenodo repo. It will be useful for me in the long-term.

Closing this issue now.

-DT