Modify structure verification_file.csv

damianooldoni commented 5 years ago

File verification_file.csv will be used by taxonomists for verifying taxa. At the moment, it contains the following columns in the following order:

scientificName
bb_scientificName
bb_taxonomicStatus
bb_acceptedName
bb_key
bb_acceptedKey
bb_kingdom
issues
verification_key
date_added
checklists
remarks

This structure is optimal for experts but too difficult to manage. Main source of bugs is the combination of the following properties:

Mapping by names.
Multiple checklists (comma separated) for same scientififcName allowed

Here below the columns of taxa.csv. I checked the box aside the columns I think we need to include in verification_file.csv:

[x] taxonKey
[x] scientificName
[ ] taxonID
[x] datasetKey
[ ] nameType
[ ] issues
[ ] validDistribution
[x] bb_key
[x] bb_scientificName
[ ] bb_species
[ ] bb_genus
[ ] bb_family
[ ] bb_order
[ ] bb_class
[ ] bb_phylum
[x] bb_kingdom
[x] bb_rank
[ ] bb_speciesKey
[x] bb_taxonomicStatus
[x] bb_acceptedKey
[x] bb_acceptedName

In addition to these columns, the next ones are peculiar of verification_file.csv and should be present:

verificationKey
dateAdded
remarks

Based on what we decide in this issue I will modify (= simplify) trias::verify_taxa(). @peterdesmet What do you think?

peterdesmet commented 5 years ago

List of columns (in order) of verification file:

taxonKey: T
scientificName
datasetKey
bb_key: B
bb_scientificName
bb_kingdom
bb_rank
bb_taxonomicStatus
bb_acceptedKey: A
bb_acceptedName
bb_acceptedKingdom: expected to be same as bb_kingdom
bb_acceptedRank
bb_acceptedTaxonomicStatus: expected to always be ACCEPTED
verificationKey
remarks
dateAdded

Note:

5 columns 4 to 8 reflect same structure as columns 9 to 12, making it easier to compare
the existence of a line should be decided on the combination of T, B and A
extension should be tsv, so it can be easily copy pasted
file should be written to interim/verification_file.tsv

damianooldoni commented 5 years ago

While working on this, I found that it would be very practical to have a kind of boolean column to indicate whether the synonym relation is outdated or not. Up to now, we agreed to just add Outdated synonym. to column remarks. Some advantages of adding such column:

Taxonomists can easily filter the outdated synonyms out.
Code of function verify_taxa() is more readable.
Chance of bugs in function verify_taxa() decreases.
Function verify_taxa() works faster.

Suggested column name: outdated. @peterdesmet : what do you think about?

peterdesmet commented 5 years ago

Agreed, as last column.

damianooldoni commented 5 years ago

I would say second to last column. I would leave remarks as last one. @peterdesmet : Do you agree?

damianooldoni commented 5 years ago

I see that dateAdded is the last one in your description. Ok, then I put outdated as last one! :+1:

peterdesmet commented 5 years ago

Yeah, the remarks fields is moved forward so that editors can easily add things there.

damianooldoni commented 5 years ago

Based on triplet T, B and A (c("taxonKey", "bb_key", "bb_acceptedKey")) the outdated synonyms are detected while detecting unused taxa. That's nice actually. I would call such extra column used instead of outdated. Accepted values:

TRUE (in use) ,
FALSE (not in use).

peterdesmet commented 5 years ago

I understand your reasoning, but the active step is marking something as outdated (TRUE) on a rerunning of the script. The other values (FALSE) are just default values: saying it is used can be misleading (e.g. if there is no verification key it will not be "used").

damianooldoni commented 5 years ago

I understand your reasoning and I agree with it. Up to now, one of the informative output dataframes of verify_taxa() is unused_taxa: it includes all taxa in verification_file.csv which are not in the input taxa, independent of verification key. Following your reasoning I would recall this df outdated_taxa.

peterdesmet commented 5 years ago

@damianooldoni I think this can be closed?

trias-project / unified-checklist

Modify structure verification_file.csv #17