plazi / ggxml2taxpub

Conversion of GoldenGATE XML to JATS/TaxPub at treatment level
0 stars 1 forks source link

use of Taxpub tags after import to SIBiLS #38

Open myrmoteras opened 2 years ago

myrmoteras commented 2 years ago

@tcatapano @jgobeill @pruch

following the issue https://github.com/plazi/eBioDiv/issues/28#issuecomment-1063968467 and discussion at tech meeting March 10, the use of tags is as follows in the EJT corpus https://github.com/plazi/ggxml2taxpub-treatments/tree/main/level1

image

the tags "563" are used for indexing and creating an article view (in fact a treatment centric view)

@jgobeill, @pruch please comment

patruch commented 2 years ago

Thank you for the notes @myrmoteras !

myrmoteras commented 2 years ago

@patruch @jgobeill

a thought regarding the removal of : the advantage of having a set up types is that it would allow more specific searches. For example, if we keep the type=conservation, then we could ask the question what are species of a certain conservation status? what conservation status are available? The answer then could just be what's in the tag.

Would that be helpful for the reuse of treatments in SIBiLS?

For me, for example the case with conservation, could be interesting to work with the redlist community at IUCN, the world conservation union.

Similarly is the case of using type=biology_ecology where all the behavioral stuff in a treatment is located, and thus might facilitate searching?!

myrmoteras commented 2 years ago

@patruch @jgobeill

here is another example from the Handbook of the Mammals of the world you will get

https://tb.plazi.org/GgServer/taxPubL1/03C36F2EFFFB347EFF11441DF6EF0C5F

In this case, the book uses a set of additional types. such as "activity" or "breeding". these are essentiall subtypes of "biology_ecology" and I wonder whether this might be something to consider? May be in a next phase? May be create a vocabulary of terms we then use in SIBiLS?

jgobeill commented 2 years ago

@myrmoteras @patruch This is document representation. For the next prototype, we certainly will try different fields and representation... even if some will be redundant. But you'll be able to choose what representation is most useful.

patruch commented 2 years ago

Just a question: how do you pick up these tags like "breeding" or "activity" ? Is it something systematic or more ad hoc ? For instance "breeding" could be borrowed from the NCI Thesaurus's definition https://www.ebi.ac.uk/ols/ontologies/ncit/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCIT_C42877.

myrmoteras commented 2 years ago

@patruch there are three answers:

  1. In most cases, we have a a list of types that we use in our processing (see the table below)
  2. in the case of Handbooks of the Mammals of the World we use for the ca 6,500 mammal species additionally the types that are used through the book. They could be groupe as childrens of type=biology_ecology
  3. Pensoft uses a different approach. They just use for the sec-type whatever the author uses as title for the section. So there is a huge number of ad hoc types.

You can see the distribution of sec-types here: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

DocCount | SectType |   -- | -- | -- 739159 | nomenclature 307445 | description 293299 | materials_examined 264425 | reference_group 258774 | distribution 153970 | discussion 133457 | diagnosis |   107163 | multiple |   101206 | etymology 36734 | biology_ecology 29415 | notes |   18268 | key |  

nomenclatureis not a sec-type, but tp:nomenclature

source: sec-types ranking.csv

This is for sure something we should discuss, probably before we make all accessible. Most of the terms could be mapped to one of the widely used, and at at the same time, we could use some hierarchy in the terms.