plazi / community

This repo is intended to serve as a help desk for TreatmentBank-users.
6 stars 1 forks source link

Only include the classification as given in the treatments #230

Open mdoering opened 1 year ago

mdoering commented 1 year ago

AFAIK the dwc archives of Plazi make use of Catalogue of Life and GBIF to provide a standard classification for the species in a treatment. As this data is used to build the former, we end up with a self assuring loop.

It would be much better to only include the taxonomic hierarchy as found in the article. That might be difficult to extract in some cases and even non existing in others, but it would be a lot more accurately representing what was published.

myrmoteras commented 1 year ago

@mdoering you are right, we import the hierarchy to put the taxa into a hierarchy to find them. This has clearly disadvantages, but it is also clearly defined and helpful for our purpose. I do not regard this as self assuring.

We have discussions with EJT and the MNHN team about extracting the hierarchy from the text as presented, and are slowly pursuing this. This would also allow to import alternative taxonomies into ChecklistBank/COL. But, as you pointed out, since this is not a standard in publications, it is not sufficient for our work.

Eventually Checklistbank curation will have to decide what the higher taxonomic ranks are for each taxon according to their own used hierarchy.

May be we could discuss this next week before the GBIF roundtable - assuming you will be in Copenhagen?

We need to see, whether we could export only taxonomic hierarchies that are extracted from the publication, and internally how we could attribute names with two hiearchies?!

mdoering commented 1 year ago

I don't argue that it is useful for your purposes and to decorate the treatments with a standard classification for Plazi to use for searches, metrics, browsing or even linking.

I am only argueing that this should not be included in the DwC archives that you serve for others. All users I have spoken with were pretty surprised and considered the information to be a pure digitisation of a treatment. Not an interpreted view.

If you need the classification to be present in the dwc archives yourself maybe there should be 2 versions of it? COL and GBIF digest the treatments with an classification which comes from themselves. This is a dangerous loop.

mdoering commented 1 year ago

Including some other classification also leads to errors like this Lepidoptera genus being places in plants: https://treatment.plazi.org/id/03EC879FFFB1FFCEA875AF4BFDF21428

mdoering commented 1 week ago

@gsautter where do we stand with the alternative dwca (or ColDP?) having only the original classification?

Some other issues with wrong interpreted classifications:

gsautter commented 1 week ago

@mdoering I have not forgotten the ColDP task ... the plan is to add the scoped higher taxa from article titles and intermediate headings to the treatments, and then to export only those to the ColDP (as opposed to the way the DwCA is populated). There has also been a "headings" field group in our article statistics since late August to support the effort, mainly to assess which articles to tackle, and which labels to expect ahead of the higher taxa, aka, what to aim for with a tagger.

The corresponding mass update is supposed to start in 2-3 weeks, so the treatment collection dumps that have to include all the modified treatments (I'm estimating maybe half of the whole collection) can be scooped up by the December monthly dump and then absorbed into the annual full dump on January 1st and thus no longer concern the monthly delta. This will still leave us December for any trouble shooting.