trias-project / unified-checklist

🇧🇪 Global Register of Introduced and Invasive Species - Belgium
https://trias-project.github.io/unified-checklist/
MIT License
0 stars 1 forks source link

Modify structure verification_file.csv #17

Closed damianooldoni closed 5 years ago

damianooldoni commented 5 years ago

File verification_file.csv will be used by taxonomists for verifying taxa. At the moment, it contains the following columns in the following order:

This structure is optimal for experts but too difficult to manage. Main source of bugs is the combination of the following properties:

  1. Mapping by names.
  2. Multiple checklists (comma separated) for same scientififcName allowed

Here below the columns of taxa.csv. I checked the box aside the columns I think we need to include in verification_file.csv:

In addition to these columns, the next ones are peculiar of verification_file.csv and should be present:

Based on what we decide in this issue I will modify (= simplify) trias::verify_taxa(). @peterdesmet What do you think?

peterdesmet commented 5 years ago

List of columns (in order) of verification file:

  1. taxonKey: T
  2. scientificName
  3. datasetKey
  4. bb_key: B
  5. bb_scientificName
  6. bb_kingdom
  7. bb_rank
  8. bb_taxonomicStatus
  9. bb_acceptedKey: A
  10. bb_acceptedName
  11. bb_acceptedKingdom: expected to be same as bb_kingdom
  12. bb_acceptedRank
  13. bb_acceptedTaxonomicStatus: expected to always be ACCEPTED
  14. verificationKey
  15. remarks
  16. dateAdded

Note:

damianooldoni commented 5 years ago

While working on this, I found that it would be very practical to have a kind of boolean column to indicate whether the synonym relation is outdated or not. Up to now, we agreed to just add Outdated synonym. to column remarks. Some advantages of adding such column:

  1. Taxonomists can easily filter the outdated synonyms out.
  2. Code of function verify_taxa() is more readable.
  3. Chance of bugs in function verify_taxa() decreases.
  4. Function verify_taxa() works faster.

Suggested column name: outdated. @peterdesmet : what do you think about?

peterdesmet commented 5 years ago

Agreed, as last column.

damianooldoni commented 5 years ago

I would say second to last column. I would leave remarks as last one. @peterdesmet : Do you agree?

damianooldoni commented 5 years ago

I see that dateAdded is the last one in your description. Ok, then I put outdated as last one! :+1:

peterdesmet commented 5 years ago

Yeah, the remarks fields is moved forward so that editors can easily add things there.

damianooldoni commented 5 years ago

Based on triplet T, B and A (c("taxonKey", "bb_key", "bb_acceptedKey")) the outdated synonyms are detected while detecting unused taxa. That's nice actually. I would call such extra column used instead of outdated. Accepted values:

  1. TRUE (in use) ,
  2. FALSE (not in use).
peterdesmet commented 5 years ago

I understand your reasoning, but the active step is marking something as outdated (TRUE) on a rerunning of the script. The other values (FALSE) are just default values: saying it is used can be misleading (e.g. if there is no verification key it will not be "used").

damianooldoni commented 5 years ago

I understand your reasoning and I agree with it. Up to now, one of the informative output dataframes of verify_taxa() is unused_taxa: it includes all taxa in verification_file.csv which are not in the input taxa, independent of verification key. Following your reasoning I would recall this df outdated_taxa.

peterdesmet commented 5 years ago

@damianooldoni I think this can be closed?