Open Jegelewicz opened 3 years ago
I'll try to address some items here:
- convert to DarwinCore terms looks like we have two ways of doing this?# convert to DarwinCore terms
There's only one function to perform conversion, unless I'm missing something? But, this code only addresses some DarwinCore terms (only the ones needed in the input files we've worked with so far). We need to decide which terms we want to support.
- siphonaptera dataset: remove '\xa0' chars from relevant fields shouldn't we do this for all names?
Yes, but I wasn't sure how to iterate over all possible columns in a short line of code in R and wrote this quickly. Please fix!
- test for names containing punctuation Should this happen given that we have stripped punctuation?
We did not strip any punctuation, preferring instead to flag and separate punctuation-containing names. We wrote a function to strip it, but we do not currently use it anywhere.
- if not add that genus back into single_epithet with 'sp' for the epithet - why add sp when it was stripped earlier?
The only reason is to ensure every genus has at least one child (because a genus, by definition, must contain at least one species). But in this case, we don't know what the species name(s) might be, thus 'sp'. Also, I wasn't sure what would happen during the interaction with Taxotools downstream if a genus did not have any species represented. The canonical name (without sp.) would be uninomial, so that's a problem to deal with. Perhaps Taxtools can handle this ok. We should discuss.
(10.) fix capitalization for both genus and species all terms?
Yeah, I think we can expand this.
(19.) remove sp'salso ssp's?
Easy enough. Any other possibilities?
(25.) remove any empty strings as genera I think this means that genus is NULL but there is a term in either specificEpithet or infraspecificEpithet?
Can mean genus is '' (empty string != NULL) and that either or both of those other fields are populated or also ''.
- generate scientificNameAuthorship - note, that for animals this should include author and year...
We can handle that. Authors and years is still a WIP - see line 231. Can you link to that "note, that for animals this should include author and year..." I wasn't aware of that. Didn't see that on the DarwinCore page. The examples don't all include year, but maybe they are plant examples. Weird to give animals and plants a different set of formatting rules for the same DarwinCore term!
I'm new to collaborative github'ing, but I think if you want to rearrange the order of the code you would submit a pull request to start a new branch, change the code on that branch, test it, and if it works equally well or better, we could merge those changes back into the main branch. Maybe we should consider some test-driven development to evaluate this. This is something I am doing in Python to ensure edits produce results as expected, without manually checking inputs and outputs every time the code changes. But I am not sure how to do that in R. Maybe Vijay is familiar?
Weird to give animals and plants a different set of formatting rules for the same DarwinCore term!
The format is set based upon the nomenclatural code and plant and animal people definitely do things differently...ICZN vs ICBN. Of course, taxonomists appear loathe to follow codes, so there are also "conventions" in almost every discipline. UGH
Weird to give animals and plants a different set of formatting rules for the same DarwinCore term!
The format is set based upon the nomenclatural code and plant and animal people definitely do things differently...ICZN vs ICBN. Of course, taxonomists appear loathe to follow codes, so there are also "conventions" in almost every discipline. UGH
Right, but I couldn't find anywhere in ICZN that specified how to construct a "scientificNameAuthorship"
Why does that not surprise me? But I know there are some kind of rules - like adding parentheses when the name gets changed due to nomenclatural rules. Right?
I couldn't find anywhere in ICZN that specified how to construct a "scientificNameAuthorship"
https://www.iczn.org/the-code/the-international-code-of-zoological-nomenclature/the-code-online/
Enjoy
I couldn't find anywhere in ICZN that specified how to construct a "scientificNameAuthorship"
https://www.iczn.org/the-code/the-international-code-of-zoological-nomenclature/the-code-online/
Enjoy
I looked through that yesterday briefly and didn't see anything. I looked again today and I still don't see any specific rules about how database-related terms should be formed. This doesn't surprise me, since the ICZN is usually doesn't produce comprehensive rules, but only the minimum needed to keep taxonomy functioning (see recent controversy over what is allowed to constitute a type specimen). So, I'm still left wondering how Darwin Core formed the rule that ICZN names are Author, Year and ICBN are only Author.
how database-related terms should be formed.
It does not matter if it is in a database - the terms should be formed as indicated in the code for any purpose.
51.1. Optional use of names of authors
The name of the author does not form part of the name of a taxon and its citation is optional, although customary and often advisable.
Recommendation 51A. Citation of author and date. The original author and date of a name should be cited at least once in each work dealing with the taxon denoted by that name. This is especially important in distinguishing between homonyms and in identifying species-group names which are not in their original combinations. If the surname and forename(s) of an author are liable to be confused, these should be distinguished as in scientific bibliographies.
Recommendation 51B. Transliteration of author's name. When the author's name is customarily written in a language that does not use the Latin alphabet it should be given in Latin letters with or without diacritic marks.
51.2. Form of citation of authorship
The name of an author follows the name of the taxon without any intervening mark of punctuation, except in changed combinations as provided in Article 51.3.
Recommendation 51C. Citation of multiple authors. When three or more joint authors have been responsible for a name, then the citation of the name of the authors may be expressed by use of the term "et al." following the name of the first author, provided that all authors of the name are cited in full elsewhere in the same work, either in the text or in a bibliographic reference.
51.2.1. The name of a subsequent user, if cited, is to be separated from the name of the taxon in some distinctive and explicit manner, but not by parentheses (cf. Article 51.3), unless an explanation is included.
Example. Reference to Cancer pagurus Linnaeus as used by Latreille may be cited as "Cancer pagurus Linnaeus sensu Latreille", or as "Cancer pagurus Linnaeus (as interpreted by Latreille)" or in some other distinctive manner, but not as "Cancer pagurus Latreille" or "Cancer pagurus (Latreille)".
Recommendation 51D. Author anonymous, or anonymous but known or inferred. If the name of a taxon was (or is deemed to have been) established anonymously, the term "Anon." may be used as though it was the name of the authors. However, if the authorship is known or inferred from external evidence, the name of the author, if cited, should be enclosed in square brackets to show the original anonymity. For availability of names proposed anonymously see Article 14.
Recommendation 51E. Citation of contributors. If a scientific name and the conditions other than publication that make it available [Arts. 10 to 20] are the responsibility not of the author of the work containing them, but of some other person(s), or of less than all of joint authors, the authorship of the name, if cited, should be stated as "B in A", or "B in A & B", or in whatever form is appropriate to facilitate information retrieval (normally the date should also be cited).
Recommendation 51F. Citation of author of unavailable or excluded names. If citation of authorship for an unavailable or excluded name [Rec. 50C] is necessary or desirable, the nomenclatural status of the name should be made evident.
Examples. Halmaturus rutilis Lichtenstein, 1818 (nomen nudum); Yerboa gigantea Zimmermann, 1777 (published in a work rejected by the Commission in Opinion 257); "Pseudosquille" (a vernacular name published by Eydoux & Souleyet (1842)).
51.3. Use of parentheses around authors' names (and dates) in changed combinations
When a species-group name is combined with a generic name other than the original one, the name of the author of the species-group name, if cited, is to be enclosed in parentheses (the date, if cited, is to be enclosed within the same parentheses).
Example. Taenia diminuta Rudolphi, when transferred to the genus Hymenolepis, is cited as Hymenolepis diminuta (Rudolphi) or Hymenolepis diminuta (Rudolphi, 1819).
51.3.1. Parentheses are not used when the species-group name was originally combined with an incorrect spelling or an emendation of the generic name (this applies even though an unjustified emendation is an available name with its own authorship and date [Art. 33.2.3]).
Example. The species-group name subantiqua d'Orbigny, 1850 was established in combination with Fenestrella, d'Orbigny's incorrect spelling of Fenestella Lonsdale, 1839. The species is cited as Fenestella subantiqua d'Orbigny, 1850, and not as Fenestella subantiqua (d'Orbigny, 1850).
51.3.2. The use of parentheses enclosing the name of the author and the date is not affected by the presence of a subgeneric name, by transfer to a different subgenus within the same genus, by a change of rank within the species group, or by transfer of a subspecies to a different species within the same genus.
Example. Goniocidaris florigena Agassiz, when transferred to the genus Petalocidaris, is cited as Petalocidaris florigena (Agassiz). When Petalocidaris is treated as a subgenus of Goniocidaris the parentheses are omitted, even when the complete citation is given as Goniocidaris (Petalocidaris) florigena Agassiz.
51.3.3. If before 1961 a new species-group name was established in combination with a previously available genus-group name and, at the same time, the author conditionally proposed a new nominal genus for it, parentheses are not used with the author's name when the species-group name is used in combination with the previously established generic name, but are used when the species-group name is combined with the conditionally proposed generic name (see Article 11.9.3.6).
Example. Lowe (1843) established the new fish species Seriola gracilis and at the same time conditionally proposed a new genus Cubiceps to contain that nominal species. When included in Cubiceps, the name is cited as Cubiceps gracilis (Lowe, 1843).
Recommendation 51G. Citation of person making new combination. If it is desired to cite both the author of a species-group nominal taxon and the person who first transferred it to another genus, the name of the person forming the new combination should follow the parentheses that enclose the name of the author of the species-group name (and the date, if cited; see Recommendation 22A.3).
Examples. Limnatis nilotica (Savigny) Moquin-Tandon; Methiolopsis geniculata (Stål, 1878) Rehn, 1957.
I think this is the pertinent remark:
Recommendation 51A. Citation of author and date. The original author and date of a name should be cited at least once in each work dealing with the taxon denoted by that name.
Including the name in a database is including it in a "work dealing with the taxon denoted by that name", so it should be cited appropriately at least once - we would do that via scientificNameAuthroship following all of the rules above. (Assuming we can figure them out...)
Well, then perhaps my issue is with ICBN. Are you telling me the ICBN doesn't rule that years should be given on citations of taxonomic names??
Nope - I have no idea what ICBN says. Note that even ICZN uses author only in it's examples even though it explicitly states "original author and date", which implies a date, not a year....
Here is ICBN - https://www.iapt-taxon.org/icbn/main.htm
See article 46
46.1. In publications, particularly those dealing with taxonomy and nomenclature, it may be desirable, even when no bibliographic reference to the protologue is made, to cite the author(s) of the name concerned (see Art. 6 Note 2; see also Art. 22.1 and 26.1). In so doing, the following rules are to be followed. Ex. 1. Rosaceae Juss., Rosa L., Rosa gallica L., Rosa gallica var. eriostyla R. Keller, Rosa gallica L. var. gallica.
I don't know who writes this stuff, but it is as clear as mud.
I hate that so much. Also, 51A is just 'recommendation', so not actually a rule... gah! whyyyy
Because - taxonomy....
Taxonomy really needs to step into the 20th century, yes, I said 20th....
And we need rules dammit! All of these "recommendations" just make for bad science.
@njdowdy I have done a lot of stuff to my branch of the ixodes code, but I am not finished. Hopefully by the time we meet up on Monday afternoon, I'll be done and we can discuss.
Great! Looking forward to it.
I think we need to reorder some of the steps in this script so that we aren't losing information along the way. Taking the comments from the code, current order is (I have some comments in bold):
Sort of using the above, here is the order that I propose for data cleaning:
Review and return outliers or place them in the "Expert Review" batch
Generate missing data
OK - this is a work in process, I have to go eat lunch....we can discuss this afternoon.