njdowdy / ixodes-tpt

Tick taxonomy for TPT project
GNU General Public License v3.0
0 stars 1 forks source link

Order of data cleaning steps #1

Open Jegelewicz opened 3 years ago

Jegelewicz commented 3 years ago

I think we need to reorder some of the steps in this script so that we aren't losing information along the way. Taking the comments from the code, current order is (I have some comments in bold):

  1. import taxoworks library 2, load data
  2. number of starting records for verification
  3. lower case column names
  4. remove columns that do not relate to taxonomy
  5. convert to DarwinCore terms looks like we have two ways of doing this?# convert to DarwinCore terms
  6. basic string cleaning functions
  7. remove punctuation (but not spaces)
  8. siphonaptera dataset: remove '\xa0' chars from relevant fields shouldn't we do this for all names?
  9. fix capitalization for both genus and species
  10. select single-word specific_epithets
  11. no-name species OR genus
  12. single-name species AND genus
  13. multi-name species OR genus
  14. multi-name subspecies
  15. single subspecific name OR no subspecies
  16. strip spaces from ends of strings
  17. test for names containing punctuation Should this happen given that we have stripped punctuation?
  18. remove sp's
  19. remove very short names for manual verification
  20. very short specific_epithet OR genus
  21. insert some code to check that all "incomplete_epithet" higher taxonomy is present in "single_epithet"
  22. if not add that genus back into single_epithet with 'sp' for the epithet - why add sp when it was stripped earlier?
  23. !(unique(incomplete_epithet$genus) %in% unique(single_epithet$genus))
  24. remove any empty strings as genera
  25. why doesn't unique(single_epithet$genus) work?
  26. set subspecies to 'sp' why add sp when it was stripped earlier?
  27. single_epithet <- rbind(single_epithet, missing_genera) # not sure if we want to do this without checking first
  28. if the above line is approved, we may need to adjust the 'verification passed' check below
  29. combine all rows requiring expert review
  30. for additional parsing
  31. successfully parsed
  32. verify no records were lost
  33. generate canonical name
  34. check Levenshtein's Distance (e.g., misspellings) [may need to do before canonical name generation] Watch for: Ornithodoros vunkeri; Ornithodoros yukeri; Ornithodoros yunkeri
  35. import stringdist library
  36. check for duplicate names
  37. deduplicated list
  38. synonymize subspecies example: Amblyomma triguttatum triguttatum = Amblyomma triguttatum
  39. parsed$genus <- array(as.character(unlist((parsed$genus)))) # sometimes needed to sort by variable in RStudio
  40. parsed$species <- array(as.character(unlist((parsed$species)))) # sometimes needed to sort by variable in RStudio
  41. number unique
  42. handle incomplete_epithet
  43. handle multi-word names
  44. handle authors, years

Sort of using the above, here is the order that I propose for data cleaning:

  1. (1.) import taxoworks library
  2. (36.) import stringdist library
  3. (2.) load data
  4. (3.) number of starting records for verification
  5. (42.) number unique Clean up columns/headers
  6. (4.) lower case column names
  7. (5.) remove columns that do not relate to taxonomy
  8. (6.) convert to DarwinCore terms looks like we have two ways of doing this?# convert to DarwinCore terms Clean up text
  9. (7.) basic string cleaning functions
  10. (8.) remove punctuation (but not spaces)
  11. (9.) remove '\xa0' chars from relevant fields shouldn't we do this for all names?
  12. (10.) fix capitalization for both genus and species all terms?
  13. (17.) strip spaces from ends of strings
  14. (19.) remove sp'salso ssp's? Set aside outliers for further review
  15. (20.) remove very short names for manual verification
  16. (21.) very short specific_epithet OR genus
  17. (25.) remove any empty strings as genera I think this means that genus is NULL but there is a term in either specificEpithet or infraspecificEpithet?
  18. (43.) handle incomplete_epithet
  19. (44.) handle multi-word names

Review and return outliers or place them in the "Expert Review" batch

Generate missing data

  1. (34.) generate canonical name
  2. generate taxonRank
  3. generate scientificName
  4. generate scientificNameAuthorship - note, that for animals this should include author and year...
  5. generate parentNameUsage Remove Duplicates
  6. (37.) check for duplicate names remove
  7. (38.) deduplicated list Generate missing names - this will create "new" rows that will need numbering
  8. (39.) synonymize subspecies example: Amblyomma triguttatum triguttatum = Amblyomma triguttatum
  9. generate specifcEpithet if not present for infraspecificEpithet
  10. **generate genus if not present for specificEpithet
  11. generate subgenus if needed
  12. generate all higher taxa if not present Look for potential misspellings/duplicates
  13. (35.) check Levenshtein's Distance (e.g., misspellings) [may need to do before canonical name generation] # Watch for: Ornithodoros vunkeri; Ornithodoros yukeri; Ornithodoros yunkeri
  14. Remove any similar names that cannot be easily determined as appropriate for expert review.

OK - this is a work in process, I have to go eat lunch....we can discuss this afternoon.

njdowdy commented 3 years ago

I'll try to address some items here:

  1. convert to DarwinCore terms looks like we have two ways of doing this?# convert to DarwinCore terms

There's only one function to perform conversion, unless I'm missing something? But, this code only addresses some DarwinCore terms (only the ones needed in the input files we've worked with so far). We need to decide which terms we want to support.

  1. siphonaptera dataset: remove '\xa0' chars from relevant fields shouldn't we do this for all names?

Yes, but I wasn't sure how to iterate over all possible columns in a short line of code in R and wrote this quickly. Please fix!

  1. test for names containing punctuation Should this happen given that we have stripped punctuation?

We did not strip any punctuation, preferring instead to flag and separate punctuation-containing names. We wrote a function to strip it, but we do not currently use it anywhere.

  1. if not add that genus back into single_epithet with 'sp' for the epithet - why add sp when it was stripped earlier?

The only reason is to ensure every genus has at least one child (because a genus, by definition, must contain at least one species). But in this case, we don't know what the species name(s) might be, thus 'sp'. Also, I wasn't sure what would happen during the interaction with Taxotools downstream if a genus did not have any species represented. The canonical name (without sp.) would be uninomial, so that's a problem to deal with. Perhaps Taxtools can handle this ok. We should discuss.

(10.) fix capitalization for both genus and species all terms?

Yeah, I think we can expand this.

(19.) remove sp'salso ssp's?

Easy enough. Any other possibilities?

(25.) remove any empty strings as genera I think this means that genus is NULL but there is a term in either specificEpithet or infraspecificEpithet?

Can mean genus is '' (empty string != NULL) and that either or both of those other fields are populated or also ''.

  1. generate scientificNameAuthorship - note, that for animals this should include author and year...

We can handle that. Authors and years is still a WIP - see line 231. Can you link to that "note, that for animals this should include author and year..." I wasn't aware of that. Didn't see that on the DarwinCore page. The examples don't all include year, but maybe they are plant examples. Weird to give animals and plants a different set of formatting rules for the same DarwinCore term!

I'm new to collaborative github'ing, but I think if you want to rearrange the order of the code you would submit a pull request to start a new branch, change the code on that branch, test it, and if it works equally well or better, we could merge those changes back into the main branch. Maybe we should consider some test-driven development to evaluate this. This is something I am doing in Python to ensure edits produce results as expected, without manually checking inputs and outputs every time the code changes. But I am not sure how to do that in R. Maybe Vijay is familiar?

Jegelewicz commented 3 years ago

Weird to give animals and plants a different set of formatting rules for the same DarwinCore term!

The format is set based upon the nomenclatural code and plant and animal people definitely do things differently...ICZN vs ICBN. Of course, taxonomists appear loathe to follow codes, so there are also "conventions" in almost every discipline. UGH

njdowdy commented 3 years ago

Weird to give animals and plants a different set of formatting rules for the same DarwinCore term!

The format is set based upon the nomenclatural code and plant and animal people definitely do things differently...ICZN vs ICBN. Of course, taxonomists appear loathe to follow codes, so there are also "conventions" in almost every discipline. UGH

Right, but I couldn't find anywhere in ICZN that specified how to construct a "scientificNameAuthorship"

Jegelewicz commented 3 years ago

Why does that not surprise me? But I know there are some kind of rules - like adding parentheses when the name gets changed due to nomenclatural rules. Right?

Jegelewicz commented 3 years ago

I couldn't find anywhere in ICZN that specified how to construct a "scientificNameAuthorship"

https://www.iczn.org/the-code/the-international-code-of-zoological-nomenclature/the-code-online/

Enjoy

njdowdy commented 3 years ago

I couldn't find anywhere in ICZN that specified how to construct a "scientificNameAuthorship"

https://www.iczn.org/the-code/the-international-code-of-zoological-nomenclature/the-code-online/

Enjoy

I looked through that yesterday briefly and didn't see anything. I looked again today and I still don't see any specific rules about how database-related terms should be formed. This doesn't surprise me, since the ICZN is usually doesn't produce comprehensive rules, but only the minimum needed to keep taxonomy functioning (see recent controversy over what is allowed to constitute a type specimen). So, I'm still left wondering how Darwin Core formed the rule that ICZN names are Author, Year and ICBN are only Author.

Jegelewicz commented 3 years ago

how database-related terms should be formed.

It does not matter if it is in a database - the terms should be formed as indicated in the code for any purpose.

51.1. Optional use of names of authors

The name of the author does not form part of the name of a taxon and its citation is optional, although customary and often advisable.

Recommendation 51A. Citation of author and date. The original author and date of a name should be cited at least once in each work dealing with the taxon denoted by that name. This is especially important in distinguishing between homonyms and in identifying species-group names which are not in their original combinations. If the surname and forename(s) of an author are liable to be confused, these should be distinguished as in scientific bibliographies.

Recommendation 51B. Transliteration of author's name. When the author's name is customarily written in a language that does not use the Latin alphabet it should be given in Latin letters with or without diacritic marks.

51.2. Form of citation of authorship

The name of an author follows the name of the taxon without any intervening mark of punctuation, except in changed combinations as provided in Article 51.3.

Recommendation 51C. Citation of multiple authors. When three or more joint authors have been responsible for a name, then the citation of the name of the authors may be expressed by use of the term "et al." following the name of the first author, provided that all authors of the name are cited in full elsewhere in the same work, either in the text or in a bibliographic reference.

51.2.1. The name of a subsequent user, if cited, is to be separated from the name of the taxon in some distinctive and explicit manner, but not by parentheses (cf. Article 51.3), unless an explanation is included.

Example. Reference to Cancer pagurus Linnaeus as used by Latreille may be cited as "Cancer pagurus Linnaeus sensu Latreille", or as "Cancer pagurus Linnaeus (as interpreted by Latreille)" or in some other distinctive manner, but not as "Cancer pagurus Latreille" or "Cancer pagurus (Latreille)".

Recommendation 51D. Author anonymous, or anonymous but known or inferred. If the name of a taxon was (or is deemed to have been) established anonymously, the term "Anon." may be used as though it was the name of the authors. However, if the authorship is known or inferred from external evidence, the name of the author, if cited, should be enclosed in square brackets to show the original anonymity. For availability of names proposed anonymously see Article 14.

Recommendation 51E. Citation of contributors. If a scientific name and the conditions other than publication that make it available [Arts. 10 to 20] are the responsibility not of the author of the work containing them, but of some other person(s), or of less than all of joint authors, the authorship of the name, if cited, should be stated as "B in A", or "B in A & B", or in whatever form is appropriate to facilitate information retrieval (normally the date should also be cited).

Recommendation 51F. Citation of author of unavailable or excluded names. If citation of authorship for an unavailable or excluded name [Rec. 50C] is necessary or desirable, the nomenclatural status of the name should be made evident.

Examples. Halmaturus rutilis Lichtenstein, 1818 (nomen nudum); Yerboa gigantea Zimmermann, 1777 (published in a work rejected by the Commission in Opinion 257); "Pseudosquille" (a vernacular name published by Eydoux & Souleyet (1842)).

51.3. Use of parentheses around authors' names (and dates) in changed combinations

When a species-group name is combined with a generic name other than the original one, the name of the author of the species-group name, if cited, is to be enclosed in parentheses (the date, if cited, is to be enclosed within the same parentheses).

Example. Taenia diminuta Rudolphi, when transferred to the genus Hymenolepis, is cited as Hymenolepis diminuta (Rudolphi) or Hymenolepis diminuta (Rudolphi, 1819).

51.3.1. Parentheses are not used when the species-group name was originally combined with an incorrect spelling or an emendation of the generic name (this applies even though an unjustified emendation is an available name with its own authorship and date [Art. 33.2.3]).

Example. The species-group name subantiqua d'Orbigny, 1850 was established in combination with Fenestrella, d'Orbigny's incorrect spelling of Fenestella Lonsdale, 1839. The species is cited as Fenestella subantiqua d'Orbigny, 1850, and not as Fenestella subantiqua (d'Orbigny, 1850).

51.3.2. The use of parentheses enclosing the name of the author and the date is not affected by the presence of a subgeneric name, by transfer to a different subgenus within the same genus, by a change of rank within the species group, or by transfer of a subspecies to a different species within the same genus.

Example. Goniocidaris florigena Agassiz, when transferred to the genus Petalocidaris, is cited as Petalocidaris florigena (Agassiz). When Petalocidaris is treated as a subgenus of Goniocidaris the parentheses are omitted, even when the complete citation is given as Goniocidaris (Petalocidaris) florigena Agassiz.

51.3.3. If before 1961 a new species-group name was established in combination with a previously available genus-group name and, at the same time, the author conditionally proposed a new nominal genus for it, parentheses are not used with the author's name when the species-group name is used in combination with the previously established generic name, but are used when the species-group name is combined with the conditionally proposed generic name (see Article 11.9.3.6).

Example. Lowe (1843) established the new fish species Seriola gracilis and at the same time conditionally proposed a new genus Cubiceps to contain that nominal species. When included in Cubiceps, the name is cited as Cubiceps gracilis (Lowe, 1843).

Recommendation 51G. Citation of person making new combination. If it is desired to cite both the author of a species-group nominal taxon and the person who first transferred it to another genus, the name of the person forming the new combination should follow the parentheses that enclose the name of the author of the species-group name (and the date, if cited; see Recommendation 22A.3).

Examples. Limnatis nilotica (Savigny) Moquin-Tandon; Methiolopsis geniculata (Stål, 1878) Rehn, 1957.

Jegelewicz commented 3 years ago

I think this is the pertinent remark:

Recommendation 51A. Citation of author and date. The original author and date of a name should be cited at least once in each work dealing with the taxon denoted by that name.

Including the name in a database is including it in a "work dealing with the taxon denoted by that name", so it should be cited appropriately at least once - we would do that via scientificNameAuthroship following all of the rules above. (Assuming we can figure them out...)

njdowdy commented 3 years ago

Well, then perhaps my issue is with ICBN. Are you telling me the ICBN doesn't rule that years should be given on citations of taxonomic names??

Jegelewicz commented 3 years ago

Nope - I have no idea what ICBN says. Note that even ICZN uses author only in it's examples even though it explicitly states "original author and date", which implies a date, not a year....

Jegelewicz commented 3 years ago

Here is ICBN - https://www.iapt-taxon.org/icbn/main.htm

Jegelewicz commented 3 years ago

See article 46

46.1. In publications, particularly those dealing with taxonomy and nomenclature, it may be desirable, even when no bibliographic reference to the protologue is made, to cite the author(s) of the name concerned (see Art. 6 Note 2; see also Art. 22.1 and 26.1). In so doing, the following rules are to be followed. Ex. 1. Rosaceae Juss., Rosa L., Rosa gallica L., Rosa gallica var. eriostyla R. Keller, Rosa gallica L. var. gallica.

Jegelewicz commented 3 years ago

I don't know who writes this stuff, but it is as clear as mud.

njdowdy commented 3 years ago

I hate that so much. Also, 51A is just 'recommendation', so not actually a rule... gah! whyyyy

Jegelewicz commented 3 years ago

Because - taxonomy....

Jegelewicz commented 3 years ago

Taxonomy really needs to step into the 20th century, yes, I said 20th....

Jegelewicz commented 3 years ago

And we need rules dammit! All of these "recommendations" just make for bad science.

Jegelewicz commented 3 years ago

@njdowdy I have done a lot of stuff to my branch of the ixodes code, but I am not finished. Hopefully by the time we meet up on Monday afternoon, I'll be done and we can discuss.

njdowdy commented 3 years ago

Great! Looking forward to it.