plazi / community

This repo is intended to serve as a help desk for TreatmentBank-users.
6 stars 1 forks source link

on 2022-12-14 , Kendra claimed that [ Barbastello leucomelas ] should be replace with [ Barbastella leucomelas ] in [ cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b576-597 ] #189

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

on 2022-12-14 , Kendra claimed that [ Barbastello leucomelas ] should be replace with [ Barbastella leucomelas ] in [ cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b576-597 ]

myrmoteras commented 1 year ago

@jhpoelen can you please explain what this long string means and where the error is?

myrmoteras commented 1 year ago

fixed

myrmoteras commented 1 year ago

@jhpoelen can you please explain what the difference between issues 189-192 is?

jhpoelen commented 1 year ago

@jhpoelen can you please explain what this long string means and where the error is?

in cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b576-597 -

cut: prefix indicating the last part (e.g., b576-597) contains a specific byte location or range. Compatible with posix/linux cut command.

zip: prefix indicated that a zip entry is addressed within the content. In this case, that entry is: treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml

hash: is followed by the hash type (e.g., sha256) and the hash instance (e.g., 56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197) . The hash, the long hexadecimal sequence, is the unique digital fingerprint of the version of treatments-xml that was retrieved from Github.

In other words, the long string points to the exact location of the text segment that was the subject of this annotation (e.g., Barbastello leucomelas)

myrmoteras commented 1 year ago

sorry, so far we do not deal with code in the QC department. Can you please explain in a human readable format which name is wrong and in which treatment?

jhpoelen commented 1 year ago

In this case, the suspicious name is "Barbastello leucomelas" .

and, the related treatment is C30587A9A562FF93FF2F350F875ED55F .

As far as I can tell, all these information elements are currently present in the title and body of the issues.

jhpoelen commented 1 year ago

@jhpoelen can you please explain what the difference between issues 189-192 is?

The difference between the annotations associated with

https://github.com/plazi/community/issues/189 https://github.com/plazi/community/issues/190 https://github.com/plazi/community/issues/191 https://github.com/plazi/community/issues/192

is the location of the subject in the treatment. In the table below, you'll find that the suffix of the reference ids are different. In this case, they point to different text segments of a specific version of a treatment with UUID C30587A9A562FF93FF2F350F875ED55F .

subjectReferenceId seeAlso
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b576-597 https://github.com/plazi/community/issues/189
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b3054-3075 https://github.com/plazi/community/issues/190
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b3550-3571 https://github.com/plazi/community/issues/191
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b3714-3735 https://github.com/plazi/community/issues/192
jhpoelen commented 1 year ago

So, it appears that the same subject (e.g., "Barbastello leucomelas") occurs four times in the treatment.

jhpoelen commented 1 year ago

fixed

Thanks for addressing the issue.

Can you please point to the location / version of the treatment that includes the fix?

myrmoteras commented 1 year ago

The latest version in the treatment-xml repo, in TB, gbif, etc.

myrmoteras commented 1 year ago

So, it appears that the same subject (e.g., "Barbastello leucomelas") occurs four times in the treatment.

Can you please show me this in the treatment?

jhpoelen commented 1 year ago

The latest version in the treatment-xml repo, in TB, gbif, etc.

Ok, can you please be more specific? I am assuming that latest versions can change in the future.

So, it appears that the same subject (e.g., "Barbastello leucomelas") occurs four times in the treatment. Can you please show me this in the treatment?

Sure!

In both cases, the reference shows that the text segements originate from treatment with uuid C30587A9A562FF93FF2F350F875ED55F as published by github in treatements-xml with content id //sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197 . This version of your corpus can be found at:

Plazi Community. (2022). Plazi Treatments XML Archive hash://sha256/3cfd60b8b19e76d208377537835de92efdb5b945a6a71765b74ed2fe22298b42 hash://md5/594923284e3eb9965b8cbad149c76cd0f (hash://sha256/3cfd60b8b19e76d208377537835de92efdb5b945a6a71765b74ed2fe22298b42) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7443343

as well as a copy on my local hard disk.

you can make your own copy using instructions embedded in https://doi.org/10.5281/zenodo.7443343

preston clone https://zenodo.org/record/7443343/files

with that, you can retrieve the associated treatment (the specific version) via:

preston cat 'zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml'

see also attached C30587A9A562FF93FF2F350F875ED55F.xml.txt

then, you'll find the four occurrences of the text fragment using the suffix of the reference id

cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b576-597
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b3054-3075
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b3550-3571
cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b3714-3735

or b576-597 b3054-3075 b3550-3571 b3714-3735

you'll find that all text segments contains instances of "Barbastello leucomelas" .

But perhaps, I should ask the question - what would you expect to use to highlight the annotated text segments? How do you annotate (and reference!) text segments in your annotated treatement texts?

jhpoelen commented 1 year ago

You can retrieve the text segments using:

preston cat 'cut:zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml!/b576-597'

yielding

Barbastello leucomelas

or

preston cat 'zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml' | cut -z -b576-597

I can imagine that some fancy (web) UI can be hacked on top of this to make it visually a little more intuitive.

Thanks for being patient in resolving the annotations reported by @kephelps, @qgroom and you.

myrmoteras commented 1 year ago

Errors are fixed in in TB.

For this, I need to know from you docId, docVersion, taxonomicName id

myrmoteras commented 1 year ago

or else, Kendra could ask: in this treatment https://tb.plazi.org/GgServer/html//C30587A9A562FF93FF2F350F875ED55F the name should be Barbastella
image

myrmoteras commented 1 year ago

thanks for the explanation and showing yet another way to look at and interoperate with our data liberated from publications.

jhpoelen commented 1 year ago

For this, I need to know from you docId, docVersion, taxonomicName id

I believe that the requested information can be found in the header of the associated treatments xml file as retrieved via

preston cat 'zip:hash://sha256/56caf9620cd58df6fb517dfb7cd01e2e81e54a41eef2f562c5eddfbd70ba6197!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml'\
 | head -n1\
 | sed 's/"[ ]/"\n/g'

yielding:

<document ID-DOI="http://doi.org/10.5281/zenodo.7353060"
ID-ISBN="1-56098-217-9"
ID-Zenodo-Dep="7353060"
approvalRequired="4"
approvalRequired_for_document="2"
approvalRequired_for_matCits="1"
approvalRequired_for_originalDoi="1"
checkinTime="1667534087130"
checkinUser="GgServerImporter"
docAuthor="Karl F. Koopman"
docDate="1993"
docId="C30587A9A562FF93FF2F350F875ED55F"
docLanguage="en"
docName="MammalSpeciesofTheWorld.1993.Chiroptera.137-241.pdf.imd"
docOrigin="Mammal Species of the World (2 nd Edition), Washington and London: Smithsonian Institution Press"
docTitle="Barbastello leucomelas"
docType="treatment"
docVersion="3"
lastPageNumber="199"
masterDocId="3F3CFFD1A55CFFADFFDC34768324D72D"
masterDocTitle="Order Chiroptera"
masterLastPageNumber="241"
masterPageNumber="137"
pageNumber="199"
updateTime="1669256093699"
updateUser="ExternalLinkService">

Please confirm that this is the information you need to reference the content that contains the fix proposed by @kephelps .

jhpoelen commented 1 year ago

PS not quite sure what you mean by taxon id . Can you please provide an example?

jhpoelen commented 1 year ago

@myrmoteras @flsimoes after review, I found that there's still two occurrences of the suspicious name "Barbastello leucomelas"

A copy of plazi/treatments-xml was retrieved on 2022-12-14T15:15:28.064Z and contained two instances of the suspicious name on lines 52 and 56:

via

preston cat 'zip:hash://sha256/c37add56f855607de5cbcd9d47d96346454e203559de66a15c30427e1fe172fb!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml'\
 | grep -n "Barbastello leucomelas"
52:<treatmentCitation author="Cretzschmar" authority="(Cretzschmar, 1826)" baseAuthorityName="Cretzschmar" baseAuthorityYear="1826" class="Mammalia" family="Vespertilionidae" genus="Barbastello" kingdom="Animalia" order="Chiroptera" page="73" pageId="62" pageNumber="199" phylum="Chordata" rank="species" species="leucomelas" title="Barbastello leucomelas" volumeTitle="In Ruppell, Atlas Reise Nordl. Afr., Zool. Saugeth." year="1826">
58:<bibCitation author="Cretzschmar" pageId="62" pageNumber="199" pagination="73" title="Barbastello leucomelas" volumeTitle="In Ruppell, Atlas Reise Nordl. Afr., Zool. Saugeth." year="1826">

So, I have evidence to suggest that the recommendation by @kephelps is only partially fixed.

Please confirm.

PS Please note that the recommended name did occur twice in the treatment of interest C30587A9A562FF93FF2F350F875ED55F , on line 1 and line 54

preston cat 'zip:hash://sha256/c37add56f855607de5cbcd9d47d96346454e203559de66a15c30427e1fe172fb!/treatments-xml-main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml'\
 | grep -n "Barbastella leucomelas"
1:<document ID-DOI="http://doi.org/10.5281/zenodo.7353060" ID-GBIF-Dataset="54c9783b-6624-4034-96d9-d09afe43b319" ID-ISBN="1-56098-217-9" ID-Zenodo-Dep="7353060" approvalRequired="4" approvalRequired_for_document="2" approvalRequired_for_matCits="1" approvalRequired_for_originalDoi="1" checkinTime="1667534087130" checkinUser="GgServerImporter" docAuthor="Karl F. Koopman" docDate="1993" docId="C30587A9A562FF93FF2F350F875ED55F" docLanguage="en" docName="MammalSpeciesofTheWorld.1993.Chiroptera.137-241.pdf.imd" docOrigin="Mammal Species of the World (2 nd Edition), Washington and London: Smithsonian Institution Press" docTitle="Barbastella leucomelas" docType="treatment" docVersion="6" lastPageNumber="199" masterDocId="3F3CFFD1A55CFFADFFDC34768324D72D" masterDocTitle="Order Chiroptera" masterLastPageNumber="241" masterPageNumber="137" pageNumber="199" updateTime="1673356398203" updateUser="valdenar">
54:<emphasis box="[243,606,377,416]" italics="true" pageId="62" pageNumber="199">Barbastella leucomelas</emphasis>
myrmoteras commented 1 year ago

can you please translate this into a format we use to work: please open the html version of the treatment and point us at the two wrong taxonomic names.

jhpoelen commented 1 year ago

@myrmoteras thanks for being patient with me as we are working our way through this tedious annotation adventure.

In reviewing https://tb.plazi.org/GgServer/html//C30587A9A562FF93FF2F350F875ED55F as accessed just now, it appears that the undesired occurrences of "Barbastello leucomelas" are not rendered in the html view onto the treatment. However, in the treatments-xml version, they do appear on lines 52 and 58. So, it appears I cannot point you to the text segment in the html version of the treatment, because the text is not translated into the html view.

Perhaps easier to point to a specific line in a github version at:

line 52 :

https://github.com/plazi/treatments-xml/blob/d1dc6a5b4370612f90418409ced19be7a1352a36/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml#L52

and line 58

https://github.com/plazi/treatments-xml/blob/d1dc6a5b4370612f90418409ced19be7a1352a36/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml#L58

My question remains -

What kind of specific reference will help us to point exactly to text segments that are subject of the annotation?

jhpoelen commented 1 year ago

I can confirm that this specific occurrence of Barbastello leucomelas was updated as suggested @kephelps.

For more information see https://github.com/jhpoelen/msw-plazi/commit/905e30326b6123c09e867f8752dea3d065d63e37 .