jhpoelen / msw-plazi

Mammal Species of the World treatments transcribed by Plazi
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

use treatment bank source data instead of https://github.com/plazi/treatments-xml #4

Open jhpoelen opened 1 year ago

jhpoelen commented 1 year ago

as discussed in 2023-01-11 meeting -

instead of using https://github.com/plazi/treatments-xml as a source of truth, use xml files straight from the treatmentbank database:

e.g.,

use https://tb.plazi.org/GgServer/xml//C30587A9A562FF93FF2F350F875ED55F

instead of:

https://github.com/plazi/treatments-xml/blob/main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml

jhpoelen commented 1 year ago

@flsimoes sorry to bother you again, but do you happen to know how I can download, most efficiently, all treatment xmls from treatment bank? I don't see any data download options, and I am sure that I am missing something.

jhpoelen commented 1 year ago

https://tb.plazi.org/dumps/ ?

flsimoes commented 1 year ago

@jhpoelen I'm afraid I can't help you much with this as I've not had the chance or need to do said download. I guess the dumps link you sent could be it. Otherwise you need to ask Guido or Donat

jhpoelen commented 1 year ago

It appears that only the "historized" data dumps contain the needed taxonomicName uuids -

However, the historized data dump are slightly different than those published via the ggserver webservice.

In attached zip at

C30587A9A562FF93FF2F350F875ED55F.zip

you find

$ unzip -l ~/tmp/C30587A9A562FF93FF2F350F875ED55F.zip 
Archive:  /home/jorrit/tmp/C30587A9A562FF93FF2F350F875ED55F.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
     7342  2023-01-11 12:22   C30587A9A562FF93FF2F350F875ED55F.dump.xml
     7342  2023-01-11 12:24   C30587A9A562FF93FF2F350F875ED55F.github.xml
    20645  2023-01-11 12:27   C30587A9A562FF93FF2F350F875ED55F.historized.dump.xml
     9734  2023-01-11 12:19   C30587A9A562FF93FF2F350F875ED55F.web.xml
---------                     -------
    45063                     4 files

where

C30587A9A562FF93FF2F350F875ED55F.dump.xml extract from https://tb.plazi.org/dumps/plazi.xml.daily.zip

C30587A9A562FF93FF2F350F875ED55F.github.xml downloaded from https://raw.githubusercontent.com/plazi/treatments-xml/main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml

C30587A9A562FF93FF2F350F875ED55F.historized.dump.xml extracted from https://tb.plazi.org/dumps/plazi.xmlHistory.daily.zip

and

C30587A9A562FF93FF2F350F875ED55F.web.xml downloaded from https://tb.plazi.org/GgServer/xml//C30587A9A562FF93FF2F350F875ED55F

It appears that the first two are identical (github + plazi.xml.daily.zip dump) and do not contains the needed taxonomicName ids.

and the last two (from golden gate server, plazi.xmlHistory.daily.zip dump) are different, but do contain the needed taxonomicName ids.

@myrmoteras @gsutter can you please explain why to last two are different, and which ones present a more accurate state of the treatment bank records? Which ones do you recommend to use for tracking the content of plazi treatment data?

jhpoelen commented 1 year ago

here's a screenshot of a diff between the xml "history" dump and the xml produced by ggserver -

image

jhpoelen commented 1 year ago

note that is takes about 0.5s to retrieve a single treatment xml via

$ time curl https://tb.plazi.org/GgServer/xml//C30587A9A562FF93FF2F350F875ED55F > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9734    0  9734    0     0  17504      0 --:--:-- --:--:-- --:--:-- 17507

real    0m0.563s
user    0m0.047s
sys 0m0.009s

so, it'd take me about 0.5 * 826880 (number of treatments according to https://plazi.org/treatmentbank/) = 413440s or 11 hours to retrieve all of the treatments, excluding the time it takes to retrieve a full list of all available treatment uuids.

Retrieving the data dump however, takes less than a minute, depending on internet connection.