Open jhpoelen opened 1 year ago
@flsimoes sorry to bother you again, but do you happen to know how I can download, most efficiently, all treatment xmls from treatment bank? I don't see any data download options, and I am sure that I am missing something.
@jhpoelen I'm afraid I can't help you much with this as I've not had the chance or need to do said download. I guess the dumps link you sent could be it. Otherwise you need to ask Guido or Donat
It appears that only the "historized" data dumps contain the needed taxonomicName uuids -
However, the historized data dump are slightly different than those published via the ggserver webservice.
In attached zip at
C30587A9A562FF93FF2F350F875ED55F.zip
you find
$ unzip -l ~/tmp/C30587A9A562FF93FF2F350F875ED55F.zip
Archive: /home/jorrit/tmp/C30587A9A562FF93FF2F350F875ED55F.zip
Length Date Time Name
--------- ---------- ----- ----
7342 2023-01-11 12:22 C30587A9A562FF93FF2F350F875ED55F.dump.xml
7342 2023-01-11 12:24 C30587A9A562FF93FF2F350F875ED55F.github.xml
20645 2023-01-11 12:27 C30587A9A562FF93FF2F350F875ED55F.historized.dump.xml
9734 2023-01-11 12:19 C30587A9A562FF93FF2F350F875ED55F.web.xml
--------- -------
45063 4 files
where
C30587A9A562FF93FF2F350F875ED55F.dump.xml extract from https://tb.plazi.org/dumps/plazi.xml.daily.zip
C30587A9A562FF93FF2F350F875ED55F.github.xml downloaded from https://raw.githubusercontent.com/plazi/treatments-xml/main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml
C30587A9A562FF93FF2F350F875ED55F.historized.dump.xml extracted from https://tb.plazi.org/dumps/plazi.xmlHistory.daily.zip
and
C30587A9A562FF93FF2F350F875ED55F.web.xml downloaded from https://tb.plazi.org/GgServer/xml//C30587A9A562FF93FF2F350F875ED55F
It appears that the first two are identical (github + plazi.xml.daily.zip dump) and do not contains the needed taxonomicName ids.
and the last two (from golden gate server, plazi.xmlHistory.daily.zip dump) are different, but do contain the needed taxonomicName ids.
@myrmoteras @gsutter can you please explain why to last two are different, and which ones present a more accurate state of the treatment bank records? Which ones do you recommend to use for tracking the content of plazi treatment data?
here's a screenshot of a diff between the xml "history" dump and the xml produced by ggserver -
note that is takes about 0.5s to retrieve a single treatment xml via
$ time curl https://tb.plazi.org/GgServer/xml//C30587A9A562FF93FF2F350F875ED55F > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9734 0 9734 0 0 17504 0 --:--:-- --:--:-- --:--:-- 17507
real 0m0.563s
user 0m0.047s
sys 0m0.009s
so, it'd take me about 0.5 * 826880 (number of treatments according to https://plazi.org/treatmentbank/) = 413440s or 11 hours to retrieve all of the treatments, excluding the time it takes to retrieve a full list of all available treatment uuids.
Retrieving the data dump however, takes less than a minute, depending on internet connection.
as discussed in 2023-01-11 meeting -
instead of using https://github.com/plazi/treatments-xml as a source of truth, use xml files straight from the treatmentbank database:
e.g.,
use https://tb.plazi.org/GgServer/xml//C30587A9A562FF93FF2F350F875ED55F
instead of:
https://github.com/plazi/treatments-xml/blob/main/data/C3/05/87/C30587A9A562FF93FF2F350F875ED55F.xml