plazi / arcadia-project

2 stars 1 forks source link

frankfurt #140

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

@gsautter and @mguidoti can we spend on Friday morning at POA a moment to set up the production of at least the well established journal in Frankfurt? That means we have an understanding of what we want to do exactly.

  1. upload files to Frankfurt
  2. trigger the processing
  3. Upload files to TB with a certain level of control
  4. Provide access / alert for files that fail to have treatments, metadata
  5. Have an easy access to view and curate the files

For the begin, I then could upload all the Zootaxa. For EJT we need to discuss what's better to process and make accessible immediately, but with an immediate QC control within a day or two to meet the standards agreed with EJT. For other journals with template tbd

mguidoti commented 4 years ago

Hi Donat,

As said before, I honestly don't think it's possible to include the implementation of this entire system the way you want it by Friday, because we have other things that requires Guido's presence here to be done - this doesn't.

Also as I said before, the technical specification on how this would work is described here.

From all of your comments thus far I came to realize that it might be a fundamental misconception between all parts regarding this automation process in the Frankfurt server.

First, we can't have a fully automated process (from downloading PDFs to Extracting AND Uploading treatments) because these will have Zenodo DOI registered and the process isn't exactly flawless to the point that we can blindly trust it (e.g. nested treatments, duplicates, etc).

Thus, we can automate:

  1. Web scraping/automatic PDF download
  2. Internal triggers to run batch process
  3. E-mail reporting so we can QC at BLOCKER level (POA Office) and Upload
  4. Automate reporting to Sofia so they can QC at CRITICAL level

Everything was spec out in this document. Web scraping is publisher-specific and I haven't had the time to put my hands on this yet. Johaness will build the bridge between my well-defined output from the scrapers to the GGI on the server. Guido will do the rest. All of this is described here, but it takes more time than until Friday to implement, especially when we have all QC-related things to discuss and some tickets on this KanBan to go throught yet.

For Zootaxa, if you prefer, we can set a Google Drive Folder so you can upload the files there, since you've access for now, and we in PoA will process them.

myrmoteras commented 4 years ago

You talk about the entire project from webscraping to all the rest. There are many steps involved, but not all of them need be in place at the same time.

I am talking about uploading a file (I can do it from here, at least should be), then get it processed and disseminated, i.e. upload to TB. These are the steps I am doing every morning here and what I am talking about.

Adding webscraping and other steps can be done later.

If you do have no time, then I will do this step once Guido is at home.

There is another rationale behind this move: We can not invest in processing infrastructure and not using it. So we need to get some experience and tests before the next Arcadia report in May.

mguidoti commented 4 years ago

Ok, you're right.

So, I talked to Guido here and he still needs to implement two things on this part of this equation: 1) filtering bad treatments/documents by BLOCKERS to avoid sending bad ones to Zenodo; reporting to someone (PoA Office) when this happens.

He said that he needs up to a week to have his part set it up.

If you define this as top priority we can most certainly have in place (web scraping too) before the next report.

myrmoteras commented 1 year ago

@gsautter @Jo-Jo0 can we please resume the planning of automated import of new articles from those journals for which we have templates?

A goal could be that we have this in place by end of 2022 so we have a continuous flux of new articles coming into the system starting new year 2023?!

gsautter commented 1 year ago

Sure ... this has always been a goal, and I think we now can put some focus on it.