Open cboettig opened 6 years ago
@cboettig thanks for sharing. Your comments highlight various separate issues. I'll attempt to address each of them separately in the following comments. I am planning to release a new GloBI Taxon Graph version v0.3.2 with corrections applied in this thread.
First, in line 98457 in taxonCache.tsv v0.3.1, I found (please note that header was added for convenience)
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
EOL:224784 Neoniphon sammara Species "Kolvin-soldaat @af | Deek @ar | 鐵甲 @cnm | Eichhörnchenfisch @de | Sammara squirrelfish @en | Candil samara @es | Corocoro @fj | Marignan tacheté @fr | \"Ala'ihi @hw | Ukeguchi-ittoudai @ja | 무늬얼게돔 @ko | Jerra @mh | Kolithaduva @ml | Kinolu @ms | Esquilo samara @pt | Malau-tui @sm | Baga-baga @tl | Araoe @ty | Cá Son dá dài @vi | 条纹长颏鳂 @zh | 莎姆新東洋金鱗魚 @zh-Hant |" Animalia | Chordata | Actinopterygii | Beryciformes | Holocentridae | Neoniphon | Neoniphon sammara EOL:1 | EOL:694 | EOL:1905 | EOL:8234 | EOL:8237 | EOL:24504 | EOL:224784 kingdom | phylum | class | order | family | genus | species http://eol.org/pages/224784 http://media.eol.org/content/2009/05/19/16/85885_98_68.jpg
Note that commonNames value is (incorrectly) enclosed by double quotes and an escaped \"Ala'ihi @hw
On closer inspection, the commonNames value was enclosed by quotes when csv was still used to store taxonCache. This also explains the escaped double quote. Also, it appears that the Hawaiian name for Neoniphon sammara is not transcribed properly in EOL http://eol.org/pages/224784/names/common_names . Instead of "Ala'ihi
, I suspect the name should be 'ala'ihi
, replacing the double quotes with a single quote.
@jhammock any change you can update the common name? From sources like http://www.wpcouncil.org/managed-fishery-ecosystems/hawaii-archipelago/regulations-and-enforcement-hawaii/ it appears that the common name is used to describe various different species, not just Neoniphon sammara .
To correct for this, double quotes are removed and the escape double quote has been replaced with the original string reported by EOL, including the double quotes. Note that TSV does not need escaping of quotes (https://www.iana.org/assignments/media-types/text/tab-separated-values) .
A second issue was reported on line 119858:
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
EOL:392765 Handroanthus chrysanthus Species "roble amarillo @en | \"makulis\" @es | เหลืองอินเดีย @th |" Plantae | Tracheophyta | Magnoliopsida | Lamiales | Bignoniaceae | Handroanthus | Handroanthus chrysanthus EOL:281 | EOL:4077 | EOL:283 | EOL:4300 | EOL:4421 | EOL:27931337 | EOL:392765 kingdom | phylum | class | order | family | genus | species http://eol.org/pages/392765 http://media.eol.org/content/2015/02/26/03/48029_98_68.jpg
Similar pattern is observed here: csv-style escaping/quoting used because of the usage of double quotes in the text.
@jhammock any idea why makulis
for spanish common name on http://eol.org/pages/392765/names/common_names is surrounded by double quotes?
To correct, doubles quotes are removed as well as the escaped double quotes.
A third issue was reported on line 425504:
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
INAT_TAXON:379688 candidatus phytoplasma genus "Bacteria | Firmicutes | Mollicutes | \"candidatus phytoplasma\"" INAT_TAXON:67333 | INAT_TAXON:151853 | INAT_TAXON:151986 | INAT_TAXON:379688 kingdom | phylum | class | genus http://inaturalist.org/taxa/379688
Same double quoting issues here. integration tests confirm that iNaturalist explicitly reports "candidatus phytoplasma"
for the genus.
To correct, enclosing double quotes are removed as well as the escape characters.
A fourth issue was found, where entries in taxonCache were found without a taxonId column. This was a transformation mistake and entries with missing taxonId columns will be removed. Note that the entries without an id actually had valid counter parts in the taxonCache file.
Also, please note that the first three issues are definitely data errors, but not tsv parsing errors. TSV, according to IANA https://www.iana.org/assignments/media-types/text/tab-separated-values , does not have any string quoting . Please see https://github.com/tidyverse/readr/issues/844 .
If empty quote parameter is used, no problems are encountered when reading the taxonCache.tsv :
taxonCache <- readr::read_tsv('taxonCache.tsv', quote='')
Parsed with column specification:
cols(
id = col_character(),
name = col_character(),
rank = col_character(),
commonNames = col_character(),
path = col_character(),
pathIds = col_character(),
pathNames = col_character(),
externalUrl = col_character(),
thumbnailUrl = col_character()
)
|=================================================================| 100% 904 MB
> library(readr)
> problems(taxonCache)
# tibble [0 × 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>
@cboettig curious to hear your thoughts on all this.
I've prepared a pre-release of taxonCache with applied changes, please see https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Please let me know if this pre-release solves this issue. If not, or if you find new issue, please do share.
Thanks, will do! Good point on the tsv
by the way; makes total sense. The whole escaped quoting thing in csv
files always bugged me, so tsv
is a pretty clever solution I never properly appreciated (since it's harder to imagine needing a literal \t
in a text file, but easy to see why you need a literal ,
)
Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table?
I'm playing a bit with parsing the pipe strings right now; I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings. Will let you know if that surfaces any other parsing issues for me.
Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table?
I did some spot checks, and duplicates seem to exist. I removed the entries with path values that include the unexpected :
delimited values.
I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings.
I agree that zipping (combining) path / pathIds / pathNames is not convenient. It seems that most biologist are comfortable with tabular formats, so I am trying to figure out ways to mold data into that shape to lower barrier to edit / use / share without losing too much flexibility. Am open to suggestions and am in favor of exposing the same knowledge in different formats rather than taking a one-size-fits-all approach.
@jhpoelen I think I'm still seeing a whole bunch of entries with alignment issues?
library(tidyverse)
taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")
taxonCache %>% filter(!grepl("(:|-|_)", id))
shows a bunch of rows that are getting parsed that appear to have no id and so still have everything miss-aligned.
@cboettig confirmed . I've uploaded a second pass at the taxonCach.tsv.gz file, overwriting https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Thanks for sharing, please check and let me know if you see more issues.
@jhpoelen I seem to be getting a 403 access denied error at that URL now(?)
Thanks for letting me know . I've updated the access privileges and the file should be public now. Please try again - https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz .
@jhpoelen Thanks! Getting there! Looks like a possible data issue now:
e.g. row 243356
has a single entry in the path pipe-string but two entries in the pathNames pipe string.
taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")
taxonCache[243356,]$path
[1] "Gnaphalium purpureum"
> taxonCache[243356,]$pathNames
[1] "kingdom | species"
I see a total of 954 records where it looks to me that the number of pipes differs between path
and pathName
(though I guess some of these might be NA
for one or the other, which is guess is okay, but some clearly aren't like the example above).
pattern <- "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>%
map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>%
map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))
which( !(path_pipes == pathName_pipes))
Thanks against for your patience and feedback.
I went through the entries with mismatching path / path names. I found that most of the issue were due to an historic bug that didn't include empty ranks when ingesting path names. I removed the entries, after spot checking that duplicate entries existed in the taxonCache with aligned path/ids/names.
A single item, EOL:211953 Cetengraulis edentulus
appear to have a \t
embedded in common name Anchoveta raboamaril\t3
. It appears that this common name was included in the taxoncache prior to the implementation of tab replacements on writing to tsv.
The remaining issues are terms related to non-taxa like environmental terms (e.g., wood) or functional groups (e.g., plankton). These do not have path/rank names. I've included the remaining issue below.
I've uploaded an updated copy of taxonCache for your review at https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz .
This cleanup of taxonCache.tsv makes me re-realize the importance of data mobility, archiving, versioning, automated quality control, peer review and the effort this all takes...
id | name | rank | commonNames | path | pathIds | pathNames | externalUrl | thumbnailUrl |
---|---|---|---|---|---|---|---|---|
ENVO:00000339 | Stones | NA | NA | environmental feature | mesoscopic physical object | abiotic mesoscopic physical object | piece of rock | ENVO:00002297 | ENVO:00002004 | ENVO:01000010 | ENVO:00000339 | NA | http://purl.obolibrary.org/obo/ENVO_00000339 | NA |
ENVO:00001998 | soil | NA | NA | environmental material | soil | ENVO:00010483 | ENVO:00001998 | NA | http://purl.obolibrary.org/obo/ENVO_00001998 | NA |
ENVO:00002003 | bovine or equine dung | NA | NA | environmental material | organic material | bodily fluid | excreta | feces | ENVO:00010483 | ENVO:01000155 | ENVO:02000019 | ENVO:02000022 | ENVO:00002003 | NA | http://purl.obolibrary.org/obo/ENVO_00002003 | NA |
ENVO:00002007 | Sediment | NA | NA | environmental material | sediment | ENVO:00010483 | ENVO:00002007 | NA | http://purl.obolibrary.org/obo/ENVO_00002007 | NA |
ENVO:00002040 | Wood | NA | NA | environmental material | organic material | wood | ENVO:00010483 | ENVO:01000155 | ENVO:00002040 | NA | http://purl.obolibrary.org/obo/ENVO_00002040 | NA |
ENVO:01000155 | Detritus | NA | NA | environmental material | organic material | ENVO:00010483 | ENVO:01000155 | NA | http://purl.obolibrary.org/obo/ENVO_01000155 | NA |
ENVO:01000404 | plastic | NA | NA | environmental material | anthropogenic environmental material | ENVO:00010483 | ENVO:0010001 | NA | http://purl.obolibrary.org/obo/ENVO_01000404 | NA |
EOL:19662459 | Zooplankton | NA | NA | plankton | zooplankton | NA | NA | http://eol.org/pages/19662459 | NA |
EOL:19662463 | Phytoplankton | NA | NA | plankton | phytoplankton | NA | NA | http://eol.org/pages/19662463 | NA |
W:Bacterioplankton | bacterioplankton | NA | NA | plankton | bacterioplankton | NA | NA | http://wikipedia.org/wiki/Bacterioplankton | NA |
W:Macroalgae | Macroalgae | NA | NA | algae | macroalgae | NA | NA | http://wikipedia.org/wiki/Macroalgae | NA |
@jhpoelen Found some more rows with alignment / missing-id issue:
look for cases with whitespace in the id:
taxonCache %>% filter(grepl("\\s", id))
(Missed this one before because previously my pattern looked for identifiers with "(:|-|_)"
, and some species names have these in them). I think it would actually be preferable if ids were all URIs -- would that be possible? e.g. there's what looks like some UUID strings in there but they don't have the urn:uuid:
prefix, and some that seem to use _
as a prefix?
Another possible issue I noticed in pathNames:
taxonCache %>% filter(grepl(":", pathNames))
This gets the above miss-aligned ones too, but looks like it is mostly getting pathNames given by identifiers, maybe mostly from Wikidata. I see why wikidata does that so technically these aren't errors, but from a practical point of view it would be much better to have path names we can match to other path names. e.g. instead of WD:Q35409 | ...
just have family | ...
(as https://www.wikidata.org/wiki/Q35409). Or maybe that's an issue for a separate thread since it's not really about parsing problem?
Thanks!
taxonCache %>% filter(grepl("\\s", id))
Nice! This remove 41 remaining entries with misaligned columns. The accompanying entries with ids were also present in the taxonCache.I think it would actually be preferable if ids were all URIs -- would that be possible? That would be possible, and can already by done using a prefix mapping like: https://api.globalbioticinteractions.org/prefixes . You might have noticed that externalUrl expands the id to a resolvable id when possible.
e.g. there's what looks like some UUID strings in there but they don't have the urn:uuid: prefix, and some that seem to use _ as a prefix? Good point. Please note that #6 describes the origin of the prefix-less ids. I am hoping to incorporate these changes in the next major release of GloBI's taxon graph (should I rename to globi term graph instead?). I've started making manual patches using a development version and nomer, only to release that my time is probably better spent on thinking more about how to automatically validate, and report on, term mappings in addition to making the term graph more modular (e.g., splitting up term vertices and mapping edges into more manageable chunks similar to modular development of software libraries). If this is a big concern, please let me know.
taxonCache %>% filter(grepl(":", pathNames))
This additional validator only selected the wikidata path names. As you noticed, abbreviated wikidata identifiers were used to capture the rank information. This was done for pragmatic reasons. It should be relatively easy to map the rank name ids to associated labels. In the future, we might want to introduce a normalized term rank by introducing rankName and rankId, in addition to pathNames and pathNameIds. Related to #7 .
I've prepared https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz for your review. If you are ok with this version, I'll prepare another zenodo publication. Otherwise, please detail your concerns.
Please note that #6 describes the origin of the prefix-less ids. I am hoping to incorporate these changes in the next major release of GloBI's taxon graph (should I rename to globi term graph instead?). I've started making manual patches using a development version and nomer, only to release that my time is probably better spent on thinking more about how to automatically validate, and report on, term mappings in addition to making the term graph more modular (e.g., splitting up term vertices and mapping edges into more manageable chunks similar to modular development of software libraries). If this is a big concern, please let me know.
Sounds like a plan. Nice to have ALA taxon addressed. I'm still seeing 57 rows that don't have a :
in the id
, e.g.
> taxonCache %>% filter(!grepl(":", id))
# A tibble: 57 x 9
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 4701dc84-660a-4c51-bd16-593997f2370b Coelo… spec… NA Fung… urn:ls… kingdom … NA NA
2 ALA_Cladia_muelleri Cladi… unkn… NA | Cl… | ALA_… | unknown NA NA
3 ALA_Delia_hirticrura Delia… unkn… NA | De… | ALA_… | unknown NA NA
4 ALA_Oxycetonia_jucunda Oxyce… unkn… NA | Ox… | ALA_… | unknown NA NA
5 NZOR-3-100527 Proci… genus NA | Pr… | NZOR… | genus NA NA
6 NZOR-3-109825 Marie… genus NA | Ma… | NZOR… | genus NA NA
7 NZOR-3-33834 Misce… unkn… NA | Mi… | NZOR… | unknown NA NA
8 NZOR-3-40069 Proka… unkn… NA | Pr… | NZOR… | unknown NA NA
9 NZOR-3-41136 Urtic… genus NA | Ur… | NZOR… | genus NA NA
10 NZOR-3-54695 Oreoc… genus NA | Or… | NZOR… | genus NA NA
# ... with 47 more rows
Maybe that is intentional? Isn't clear if these identifiers can be resolved, notably they have no externalUrl entry, though ALA
and NZOR
look like they want to be prefixes to something(?)
There's a larger set of things with no externalUrl, some which seem to have prefixes that aren't defined in the prefix table (CoL
, CAAB
, ...), e.g.:
> taxonCache %>% filter(is.na(externalUrl))
# A tibble: 2,770 x 9
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 4701dc8… Coelom… spec… NA Fungi | Chy… urn:lsid:indexfungorum.… kingdom | p… NA NA
2 ALA_Cla… Cladia… unkn… NA | Cladia mu… | ALA_Cladia_muelleri | unknown NA NA
3 ALA_Del… Delia … unkn… NA | Delia hir… | ALA_Delia_hirticrura | unknown NA NA
4 ALA_Oxy… Oxycet… unkn… NA | Oxycetoni… | ALA_Oxycetonia_jucunda | unknown NA NA
5 CAAB:0c… Halica… spec… NA Halicarcinu… CAAB:0cd18290:475549ca:… species NA NA
6 CAAB:23… Taloch… spec… NA | Talochlam… | CAAB:23270067 | species NA NA
7 CAAB:28… Crab z… unkn… NA | Crab zoea | CAAB:28850902 | unknown NA NA
8 CAAB:53… Mastog… spec… NA Mastogloiac… CAAB:53210000 | CAAB:53… family | ge… NA NA
9 CAAB:80… Microa… unkn… microalgae… | Microalgae | CAAB:80200000 | unknown NA NA
10 CoL:254… Pseudo… spec… NA Pseudoparre… CoL:25759155 | CoL:2549… genus | spe… NA NA
# ... with 2,760 more rows
Again, I think this all just shows what an amazing resource this is to have all of this compiled in a nice file like taxonCache.tsv.gz
, as synthesizing all these resources in a single table like that is far from trivial!
Running a few experiments on the pipe paths but I think that all relates to next steps in #7 rather than possible issues in taxonCache
. Lemme know what you think about the above concerns with some of the ids
bot otherwise this is looking ready for release to me.
Looks like there might be a few cases where path, pathNames, and pathIDs do not all have the same length (not counting cases where any one of these is na). e.g. row with id = ITIS:10824
. Could be indicative of an issue?
in case it's at all helpful, here's the crummy R code I'm using to identify the ~1000 rows that appear to have issues.
## Expect same number of pipes in each entry:
pattern = "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))
pathIds_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathIds, pattern)[[1]]))
na_path <- is.na(taxonCache$path)
na_pathNames <- is.na(taxonCache$pathNames)
na_pathIds <- is.na(taxonCache$pathIds)
trouble <- which( !(pathIds_pipes == path_pipes) & !na_path & !na_pathIds)
## Here's the ~1000 rows that appear miss-matched to me
taxonCache[trouble,]
Very helpful indeed, thank for being thorough I am working on an input / output validation framework to more easily detect these inconsistencies. #8 . Curious to hear your thoughts on that.
@cboettig just published http://doi.org/10.5281/zenodo.1250572 . In this version, consistency terms and links were checked using nomer's validate-term
and validate-term-link
. Also, various fixes were included to help make the ids and their hierarchies a bit more well-behaved.
@jhpoelen Maybe I'm not understanding something here, but it seems there's ~ 500,000 rows in taxonCache involving duplicate ids?
I think this should be reproducible R code:
library(tidyverse)
taxonCache <- read_tsv("https://zenodo.org/record/1250572/files/taxonCache.tsv.gz", quote="")
dup_id <-
taxonCache %>% select(id) %>% group_by(id) %>%
summarise(n_id = length(id)) %>% filter(n_id > 1)
trouble <- taxonCache %>% semi_join(select(dup_id, id))
# a data frame with the subset of taxonCache having duplicate ids
trouble
This prevents me from establishing a unique path / pathId / pathNames for an ID; it's not clear how to resolve the conflicts. I think this is related (/the cause of) to the issue I just added to #7
@cboettig thanks for sharing. See https://github.com/globalbioticinteractions/nomer/issues/7#issuecomment-395992615 . I think this warrants a further discussion. . .
Also, please note https://github.com/globalbioticinteractions/nomer/issues/9 - would having the name source / retrieval date would provide more information on which taxon id to select?
Currently, GloBI itself uses a pretty blunt method - just use all that match to populate taxon search index/ graph.
Here's an example of a taxon id with slight changes in name hierarchies as provided by the name source. Note that http://id.biodiversity.org.au/node/apni/50587232 and https://id.biodiversity.org.au/taxon/apni/51337710 are both outdated identifiers for Plantae. So, this is an example of multiple interpretations of taxon ids.
Am leaving this issue open because it exposes some interesting effects associated to taxon ids.
Hi @jhpoelen ,
I'm running into some issues parsing the
taxonCache
file in the Zenodo-archived data http://doi.org/10.5281/zenodo.1213465, (which looks super nice otherwise btw).For instance, the
readr
package in R shows a few parsing errors, mostly due to what might be extraneous quote characters:shows these errors
Those are pretty minor though, looks like only 3 rows are having issues. More troublesome is that somehow
readr
parsing of the file is getting some rows miss-aligned, e.g. if you then do:you get a whole sequence of rows where the
path
column haspathId
values. A quick inspection of these rows shows they are all shifted over by one column, as they are all missing the first column (anid
). (Same problem can be reproduced with the base Rread.delim
, which is much slower thanreadr
implementation). Is there something that can be done to so those rows that don't have an id still begin with a proper delimiter such that they get anNA
forid
instead of causing this miss-alignment?