plazi / arcadia-project

2 stars 1 forks source link

data workflows TB - Zenodo/BLR - users (GBIF, Ocellus, etc.) #61

Open myrmoteras opened 5 years ago

myrmoteras commented 5 years ago

Following yesterdays discussion, I would like to clarify the envisioned flow of data from TB to the users of the data.

  1. Plazi /TreatmentBank is liberating data (treatments and data therein such as MC, figures)
  2. Plazi deposits figures and treatments on Zenodo, including for the latter specific file formats for reuse (DWCA, Taxpub (Terry to design, and then the default mime type; simplified GG XML version, following the needs by Puneet's Zenodeo/Ocellus).
  3. Zenodo/BLR is the primary source for users such as GBIF, BLR website.

Right now, it seems that the data users (point 3) are based on Plazi treatment bank. Right? What are the plans to change to Zenodo/BLR? The change can clearly only occur once we have the treatments uploaded and mechanism in place to keep them in sync with processing on TB side. But we have to plan for it.

What are the plans to make these adjustments? Is this realistic before the BiodivNext 2019 in October 25?

The basis for creating the custom solution for Zenodo/BLR to be able to attach several files and be able to update them without changing the DOI has been that the treatments will become the primary data source, and not TB. TB will continue to provide customized data.

This is also to remove routine and custom tasks from Plazi/TB.

Let me know what your thoughts are, especially @gsautter @punkish from an operational point of view.

gsautter commented 5 years ago

As of now, GBIF are consuming DwC-As, and we haven't even ever mentioned putting those on Zenodo ... and there are a good few reasons to keep on catering them from TB:

gsautter commented 5 years ago

That said, as long as we use the current DwC-A mechanism to get data to GBIF, I see little to no use in making Zenodo the primary source, but a lot of effort on the other hand.

Things might change once GBIF introduces treatments on their end, even though at this point I don't know what exactly this type of data object will look like at the API level, let alone with what mechanism we will be transferring the treatment object data to GBIF.

Bottom line is that right now we should keep the DwC-As in place, and on TB. Once we know more about the technical aspects of the upcoming GBIF treatment object, things might change dramatically, but not before that.

myrmoteras commented 5 years ago

we discussed this yesterady at the Arcadia skype https://github.com/plazi/arcadia-project/issues/60#issuecomment-509622576

and agreed that DWCA are one of the files we are adding to the deposit (taxpub, and the "puneet" GG XML version

gsautter commented 5 years ago

Adding the DwC-A to the article depositions makes perfect sense (and is not all too complicated, either), just declaring said supplementary deposition file the primary copy does not, for several reasons:

punkish commented 5 years ago

zusammen, perhaps the following makes sense

       .─────────────.                                
     (   ingestion   )                               
      `──────┬──────'                                
             │                                       
             │                                       
             │                                       
   ┌─────────▼─────────┐                             
   │  Treatment Bank   │        ┌───────────────────┐
┌──│    (customized    ├────┬───▶       GBIF        │
│  │     products)     │    │   └───────────────────┘
│  └─────────┬─────────┘    │   ┌───────────────────┐
│      standardized         ├───▶    hackathons     │
│       TaxPub XML          │   └───────────────────┘
│            │              │   ┌───────────────────┐
│  ┌─────────▼─────────┐    └───▶       other       │
│  │      Zenodo       │        └───────────────────┘
│  │   (permanence,    │                             
│  │version of record, │───┐                         
│  │   images, DOI)    │   │                         
│  └───────────────────┘   │                         
└────────────┐             │                         
             │             │                         
             │             │                         
      simplified XML    images                       
       with data in        │                         
        attributes         │                         
             │             │                         
             │             │                         
   ┌─────────▼─────────┐   │    ┌───────────────────┐
   │      Zenodeo      │   ├───▶▶      ocellus      │
   │       (API)       ├───┴┐   └───────────────────┘
   └───────────────────┘    │                        
                            │   ┌───────────────────┐
                            └───▶other applications │
                                └───────────────────┘                           
gsautter commented 5 years ago

@punkish thanks for that, that about sums up what I have in mind.

... nice ASCII art, by the way ;-)

myrmoteras commented 5 years ago

here the alternative version discussed, with Zenodo as the primary source for data, especially those for production level services, such as the support of DWCA to GBIF.

image and here a link to comment

https://www.draw.io/#G173rO0wO5BS6ZOGTJecw5Pm9i1G_LFTIV

myrmoteras commented 5 years ago

This represents three principles we discussed of the Arcadia project:

  1. the BLR Website follows the concept of Zenodo, whereby Zenodo delivers basic, robust functionalities and domain specific demands is driven by a service based on Zenodo
  2. Zenodo is a primary data source (not just an archive) 3, BLR has the aim to inspire the community to crate FAIR data from publications by showcasing with the website/API what is possible.
myrmoteras commented 5 years ago

How to go about XML-only articles? We have no Zenodo depositions for these articles at all to attach the DwC-A to, so at least those would have to continue to come from TB, or we need to create dummy depositions just so we have a place for GBIF to get the DwC-A from ... a lot of hassle either way, and again potentially of short-lived value due to the upcoming GBIF treatment object that might require a whole new transfer mechanism.

I do not understand this. Zookeys is an XML only and we serve DWCA for them, They have treatments so we produce a treatment deposit on Zenodo too.

myrmoteras commented 5 years ago

What I explained above (additional delays for publishing depositions, especially on updates, an additional point of failure added in the middle, incurring added complexity in notifying GBIF of updates, unclear future of GBIF data transfer in light of upcoming treatment object).

If anything is unclear about GBIF treatment object, then this holds true for the current upload too. Furthermore, if you expect any issues, then use the GBIF github issue tracker or contact Tim

Our Plazi future is that we do not maintain an ever increasing service infrastructure but make use as much of possible of large scale infrastructures like Zenodo to deliver services. A further goal is to get independent feeds of treatments into BLR.

Rather the opposite, with Zenodo as the primary source of DWCA for GBIF, we have some redundancy in the system.