punkish / zenodeo

`nodejs` interface to Zenodo/BLR community
https://zenodeo.org
Creative Commons Zero v1.0 Universal
6 stars 1 forks source link

which treatments dump should I use? #22

Open punkish opened 5 years ago

punkish commented 5 years ago

On Sep 12 (while at CERN) I download a complete dump from http://tb.plazi.org/GgServer/dumps/plazi.xml.zip. I did a spot check, and that dump still had the problem referenced in missing refString in xml 034919F42C71960265A9326F2957EB1D #21. That is, the BibRefCitation tags occur several times but have no refString. Additionally, the tags also do not have a unique GUID for each treatment part as described in A proposal to improve data integrity #14.

Then I realized that perhaps I was looking at the wrong dump. In a comment in issue 14, the following dump is mentioned http://tb.plazi.org/GgServer/dumps/plazi.zenodeo.zip. So I downloaded that earlier today, but that too doesn't have the GUIDs.

Thinking that perhaps I was off by some days, I downloaded the plazi.xml.zip dump again just now, but that too doesn't have the GUID for each treatment part. Additionally, the missing refString in xml 034919F42C71960265A9326F2957EB1D #21 is still there.

But I know that the GUIDs do exist, as mentioned in this comment on issue 16 and visible at http://treatment.plazi.org/GgServer/zenodeo/038787DAFFF7FF904BBFF925FD13F9AA.

So, my question is: which dump should I download going forward? Actually, it is a two part question: for starters, I want to download a complete dump.

But after that, I want to query for only the changes since the last download. Is this http://tb.plazi.org/GgServer/search?&indexName=0&resultFormat=XML&lastModifiedSince= the right URL for that?

Also, what should I do about the missing refString in bibRefCitations? Should I look for other ways to extract the complete reference?

cc @gsautter @myrmoteras

gsautter commented 5 years ago

http://tb.plazi.org/GgServer/dumps/plazi.zenodeo.zip is indeed the one to use for you, created especially to cater to your needs, past and future ... Historization attributes are to come ... still awaiting your feedback regarding those. The refString attribute should be there in treatments extracted from IMFs, but might be missing on treatments extracted from XML articles imported from Pensoft. We add them now, but retro-adding them will only happen when we (soon) re-import the whole TaxPub collection.

punkish commented 5 years ago

ok thanks. I downloaded plazi.zenodeo.zip and did some random checks. It doesn't have the id attribute in any of treatment parts. For example, here is a snippet from 76B5C28DF61F5377A90B6A51215CBD41.xml

<subSubSection pageId="88" pageNumber="89" type="nomenclature">
<paragraph pageId="88" pageNumber="89">
<taxonomicName LSID="76b5c28d-f61f-5377-a90b-6a51215cbd41" authority="(S. T. Blake) P. I. Forst. &amp; T. C. Wilson" family="Lamiaceae" genus="Coleus" higherTaxonomySource="treatment-meta" kingdom="Plantae" lsidName="Coleus suaveolens" order="Lamiales" pageId="88" pageNumber="89" rank="species" species="suaveolens">
<pageBreakToken pageId="88" pageNumber="89" start="start">Coleus</pageBreakToken>
suaveolens (S.T.Blake) P.I.Forst. &amp; T.C.Wilson
</taxonomicName>
<taxonomicNameLabel pageId="88" pageNumber="89">comb. nov.</taxonomicNameLabel>
</paragraph>
</subSubSection>

By the way, [http://treatment.plazi.org/GgServer/zenodeo/038787DAFFF7FF904BBFF925FD13F9AA]() looks great. I will have to do more test on files like this, but it seems to be the way I could use. Let me know when I can download a data dump with such attributes.

I am hoping to write a report generating program that can print out the treatmentIds that have certain errors such as missing attributes or missing parts. I can do that while parsing and importing the data. Let me know if something like that would be useful to you.

cc @gsautter @myrmoteras

gsautter commented 5 years ago

Try http://tb.plazi.org/GgServer/dumps/plazi.xmlHistory.zip ... it has the historized XML for the whole treatment collection. See also my lass comment in #16