plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

A question about vernalucar names #67

Open punkish opened 1 year ago

punkish commented 1 year ago

@gsautter

Consider the following (with box, pageId and pageNumber attributes removed for clarity)

<subSubSection id="C3336575552A56248876035F1784F3EC" … type="vernacular_names">
<paragraph id="8B9636FE552A56248876035F1784F3EC" …>
<heading id="D0DE8192552A56248876035F1784F3EC" …>Pygmy Hog</vernacularName>
</heading>
</paragraph>
</subSubSection>
…
</paragraph>
</subSubSection>
<subSubSection id="C3336575552A56248BB8031C19B6F3C4" … type="vernacular_names">
<paragraph id="8B9636FE552A56248BB8031C19B6F3C4" …>
<heading id="D0DE8192552A56248BB8031C19B6F3C4" …>
<emphasis id="B95DEAEC552A56248BB8031C1777F3C4" …>French:</emphasis>
<vernacularName id="052A46D0552A56248862031C17EEF3C4" …>Sanglier nain</vernacularName>
/
<emphasis id="B95DEAEC552A562488D7031C167FF3C4" …>German:</emphasis>
<vernacularName id="052A46D0552A5624896B031C16B4F3C4" …>Zwergwildschwein</vernacularName>
/
<emphasis id="B95DEAEC552A5624899A031C1944F3C4" …>Spanish:</emphasis>
<vernacularName id="052A46D0552A5624863E031C1905F3C4" …>Jabali</vernacularName>
pigmeo
</heading>
</paragraph>
</subSubSection>

You have a tag called <subSubSection … type="vernacular_names"> that has one or more children tagged <vernacularName>. In one instance, <vernacularName> appears only once. In another instance, it appears many times, and is preceded by an <emphasis> tag with no semantic attributes, but content that is a language name followed by a ':'. My questions are as follows:

  1. Why are the languages not tagged with a semantic tag or attributes?
  2. How do you determine the vernacular name/language pair?
  3. Is it fair to assume that there is a many-to-many relationship (in a relational db sense) between treatments and language:vernacularName pair?
gsautter commented 1 year ago
  • Why are the languages not tagged with a semantic tag or attributes?

Simply because it wasn't a requirement so far, ad we don't have an application for it, either ... much more sensible, and actually preempted in the vernacular names table in the treatment stats is a language field to go with each vernacular name.

  • How do you determine the vernacular name/language pair?

In this case, that was custom built by Connie for the Handbooks of the Mammals of the World ... there is no general-purpose tagger so far, simply because vernacular names are preciously rare in the scientific publications our pipeline is set up to process.

  • Is it fair to assume that there is a many-to-many relationship (in a relational db sense) between treatments and language:vernacularName pair?

It's more between taxon names and vernacular names, but the many-to-many observation is pretty much correct ... most species (at least the ones large enough to be visible to a pair of human eyeballs) do have vernacular names in all the local languages of all the geographic region they appear in ... quite possibly with some species or even genera conflated under one vernacular name in any given place as a result of similar looks.

punkish commented 1 year ago

In this case, that was custom built by Connie for the Handbooks of the Mammals of the World ... there is no general-purpose tagger so far, simply because vernacular names are preciously rare in the scientific publications our pipeline is set up to process.

ah I see. So, the handbook is the only publication with vernacular names so far?

myrmoteras commented 1 year ago

There must be more, since we have had Isabelle's project that dealt with this. I also annotated from time to time vernacular names.

gsautter commented 1 year ago

Yes, but the HBMW are the only ones we systematically tagged vernacular names in, far as I remember ... which are the documents Isabelle worked on?

myrmoteras commented 1 year ago

Is there a way to find this out checking all the documents?

New species of the dayhttps://tb.plazi.org/GgServer/static/newToday.html


From: Guido Sautter @.> Sent: Saturday, November 26, 2022 5:31:43 PM To: plazi/treatmentBank @.> Cc: Donat Agosti @.>; Comment @.> Subject: Re: [plazi/treatmentBank] A question about vernalucar names (Issue #67)

EXTERNAL SENDER

Yes, but the HBMW are the only ones we systematically tagged vernacular names in, far as I remember ... which are the documents Isabelle worked on?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fplazi%2FtreatmentBank%2Fissues%2F67%23issuecomment-1328075913&data=05%7C01%7C%7C11d4e5b6e56e47a1302908dacfcbb4b2%7Cbe0003e8c6b9496883aeb34586974b76%7C0%7C0%7C638050771104921444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pdW6EFTkA%2FSx9j%2F6stQVRDyEXHrX6gjvF08zp1r3XPs%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABDFPJE5343E7CQQEHEXNCDWKI3O7ANCNFSM6AAAAAASL5H2TQ&data=05%7C01%7C%7C11d4e5b6e56e47a1302908dacfcbb4b2%7Cbe0003e8c6b9496883aeb34586974b76%7C0%7C0%7C638050771104921444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S%2FCQYuD%2FvWseQxWnSMvu9CeKrMj2DAKg38nCoBpltMU%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

gsautter commented 1 year ago

Not an easy one, I remember getting all the annotation types took several hours ... when we did the latter two months ago, the numbers were 30472 vernacularName annotations in a total of 173 articles ... since I don't think we added too many more since then, this should still be pretty close to the total count ... having indexed only the HBMWs, the SRS stats show 29885 vernacularName annotations in a total of 152 articles (book chapters) with a total of 5787 treatments, with some 600 more hidden in 21 articles that still need to be re-indexed ... quite the needle in a haystack, if a somewhat bigger needle this time.