plazi / treatmentBank

Repository devoted to house keeping of treatmentBank
0 stars 0 forks source link

reconciling xml numbers #21

Open punkish opened 2 years ago

punkish commented 2 years ago

@gsautter

Perhaps you can help me clear my doubts… I had downloaded all the XML (history) zip archives - full, monthly, weekly, and daily (from some time back) and yesterday I successfully parsed and loaded them. The archives had the following number of files in them:

full: 611737 (~4.25GB)
monthly: 124689 (~1GB)
weekly: 5919 (~45MB)
daily: 193 (2MB)
——————
total files: 742538

I loaded them all in the db and I got 633297 treatments which matches the final number of files in my archive. (Note: When I archive the files, I rearrange them in a hierarchical order, and duplicates are removed because when a file from the zip dump is copied to the archive, if a previous version exists it is overwritten. My question to you: do those numbers sound right to you? That discrepancy of almost 110K files (742538 - 633297), is that because of new, updated versions of the same treatments?

Fwiw, the numbers for other elements were like so

materialsCitations: 1125969
treatmentAuthors: 2609993
treatmentCitations: 621661
figureCitations: 1842858
bibrefCitations: 2776550
collectionCodes: 17036

gsautter commented 2 years ago

That discrepancy of almost 110K files (742538 - 633297), is that because of new, updated versions of the same treatments?

Most likely yes ... and given our expanding export and linking activities (Zenodo, GBIF dataset keys, GBIF record keys), treatments can go through a good few versions before reaching their final state.

The first Sunday of January will produce a new full dump.