plazi / arcadia-project

2 stars 1 forks source link

annual report 2020. due May 14 #144

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

The annual report for Arcadia is due May 14, 2020

punkish commented 4 years ago

Note: Below are the year two activities as listed in the original project proposal. Since this list was created almost two years ago, it has its limitations, understandably. Please feel free to change/add activities (only x.y. level and below) as required. Please do not change the five top level activities.

The report has to be

Concise, and up to the bullet points. Report against this Report and give a forecast Ca 6 page document

  1. Liberation of data from scholarly publications 1.1. Improve online data quality control and correction tool;
    • add better feedback facilities to TB search portal
    • finalize TB search portal correction facilities
    • Define what we out date
    • Install a feedback mechanism for data issues
      1.2. Set up ingestion of taxpub based articles; and
      • already in place for Pensoft - only need to generalize 1.3. Extend template based extraction of 70K additional treatments including those from the most relevant taxonomy journals
  2. Infrastructure 2.1. Migrate existing TreatmenBank data to Zenodo; and
    • This is dependent on 1.2.1 2.2. Adjust the extraction process so new output goes to Zenodo
  3. Interfaces, Discovery tools and APIs 3.1. Create sample applications to demonstrate discovery and analytical capabilities of the API 3.1. Zenodeo API development 3.2. Ocellus, a sample application to demonstrate API capabilities 3.3. Synospecies
  4. BLR Web Presence 4.1. Enhance the website with rich information discovery tools 4.2. Brand and design the website 4.3. Refine and test the UI
  5. Outreach 5.1. Provide a daily summary of new data liberated on Twitter and Facebook 5.2. Attend at least three major conferences, two in Europe and one in the United States or elsewhere; and 5.3. Publish a scholarly publication describing our work
punkish commented 4 years ago

Please look at each of the major activity groups above (numbered 1-5). Under each of them, please read through each of the tasks listed (numbered x.y.). For whichever applies to you, please fill out the following template. (Make a new comment in this thread, copy the template, and fill it out.)

Please do not exceed the prescribed word limits. The report has to be short, and editing long texts down to the required length becomes very, very tedious. Provide URLs where needed (URLs don't count in the word limits). Please provide up to two images, if applicable. I have already created a couple of filled templates as examples.

Note 1: Task 1.1. has a list of example sub-tasks (starting with -). Modify them as needed, and create a sub-task similarly under other tasks, as required.

Note 2: @myrmoteras please assign the entire team to this issue so they may contribute toward the report.

x.y. <Task Name>

Summary (50 words or less)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed porta orci sed dignissim scelerisque. Ut congue lobortis mi vitae vehicula. Nullam iaculis et velit in facilisis. Vestibulum viverra volutpat convallis. Proin dolor enim, gravida convallis fringilla quis, eleifend id odio. Nullam ut est non turpis euismod ullamcorper. Proin dictum massa.

url: (optional) http://example.com

Progress (50 words or less)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus dolor felis, viverra sit amet felis a, aliquam molestie mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut non porttitor mauris, eu auctor augue. Suspendisse potenti. Sed tincidunt, ipsum a maximus ultrices, lacus orci sagittis odio, in varius metus sapien.

url: (optional) http://example.com image: (optional). Please provide high quality images. They will be cropped and resized eventually to max 1200px width. Of course, they can be less than 1200px.

punkish commented 4 years ago

3.1. Zenodeo API Development

Summary

The version 2 of the Zenodeo API was released with query capabilities to the complete TreatmentBank dataset. The API was also made more performant and equipped with added capabilities for future development of more analytical insights into the data as well as its use by the users.

url: https://zenodeo.punkish.org/v2

Progress

The Zenodeo API received a major version upgrade that exposed all the treatment-related data in the TreatmentBank to RESTful queries. The underlying data store and caching mechanism were tuned for high performance no matter how complex the query. The data store and the querying mechanism themselves were modified with self-tracking capabilities so every query would be recorded with its query count and performance. This lays the foundation for carrying out insightful analysis of the use of data.

image: Zenodeo

punkish commented 4 years ago

3.2. Ocellus, a sample application to demonstrate API capabilities

Summary

The image discovery application Ocellus that was developed in the year 1 of the Arcadia Project also received an upgrade by way of a more refined user interface as well as laying the groundwork for further development of more granular querying capabilities.

url: https://ocellus.punkish.org/

Progress

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus dolor felis, viverra sit amet felis a, aliquam molestie mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut non porttitor mauris, eu auctor augue. Suspendisse potenti. Sed tincidunt, ipsum a maximus ultrices, lacus orci sagittis odio, in varius metus sapien.

image: Ocellus 1

Ocellus 2

myrmoteras commented 4 years ago

@tcatapano @slint @lnielsen @gsautter @advocomplex @mguidoti @teodorgeorgiev @lyubomirpenev

at the begin of this issue you find the description of how puneet and I plan to write this time the report (the copy of the last, too long report is here to give you an idea how it will look like.

We would appreciate if you could return your parts by Tuesday 5.

The leaders of the following parts are indicated:

  1. Liberation of data from scholarly publications GS 1.1. Improve online data quality control and correction tool; GS/TG
    • add better feedback facilities to TB search portal
    • finalize TB search portal correction facilities
    • Define what we out date
    • Install a feedback mechanism for data issues
      1.2. Set up ingestion of taxpub based articles; and TC
      • already in place for Pensoft - only need to generalize GS/TG 1.3. Extend template based extraction of 70K additional treatments including those from the most relevant taxonomy journals MG
  2. Infrastructure 2.1. Migrate existing TreatmenBank data to Zenodo; AI/GS
    • This is dependent on 1.2.1 2.2. Adjust the extraction process so new output goes to Zenodo GS
  3. Interfaces, Discovery tools and APIs PK 3.1. Create sample applications to demonstrate discovery and analytical capabilities of the API
  4. BLR Web Presence TG, TC 4.1. Enhance the website with rich information discovery tools 4.2. Brand and design the website 4.3. Refine and test the UI
  5. Outreach DA 5.1. Provide a daily summary of new data liberated on Twitter and Facebook 5.2. Attend at least three major conferences, two in Europe and one in the United States or elsewhere; and 5.3. Publish a scholarly publication describing our work
myrmoteras commented 4 years ago

@gsautter here what we discussed and need:

mguidoti commented 4 years ago

Hi Donat,

Nothing on the 50% of n. sp./year goal? Lycophron, O3RT, training materials?

Thanks

myrmoteras commented 4 years ago

@mguidoti The stats should be a summary from May 16 2019 to May as close as possible to May 14 2020. I would also like to see the figures from September 1, 2019 to May 1, 2020; we should also show monthtly production, number of template based treatments, etc.

We might want to discuss lower of higher number, as what changes we plan to get to 200,000K treatments in y3 (tables, etc.)

mguidoti commented 4 years ago

Ok,

Don't forget that we have data to show how the production rate changed over the time since the office started, with the different variables involved, like granularity. I think we can show a clear tendency of a ever growing productivity since we started, and that might be an important argument in the report.

Best,

punkish commented 4 years ago

Ok,

Don't forget that we have data to show how the production rate changed over the time since the office started, with the different variables involved, like granularity. I think we can show a clear tendency of a ever growing productivity since we started, and that might be an important argument in the report.

Yup, create a task under Activity 1 and write it up. Make sure to not exceed the 50w limit. If you find yourself going over that, ask yourself if you can divide it into two tasks.

mguidoti commented 4 years ago

Ok, I'll do my best!

Thanks

punkish commented 4 years ago

Ok, I'll do my best!

awesome. Numbers showing productivity increases are very important. Would be lovely if you can create a couple of charts (see the option of images). This in itself can also be a standalone blog post on the Plazi website.

mguidoti commented 4 years ago

Yes, that's exactly what I've in mind. I'll definitely do my best. Time is a bit strict, as I spent most of my day micromanaging the office and answering question, but I'll do my best.

myrmoteras commented 4 years ago
  1. Describe Synospecies as an application based on our data https://synospecies.plazi.org/
punkish commented 4 years ago
  1. Describe Synospecies as an application based on our data https://synospecies.plazi.org/

This looks like it belongs under Activity 3, so I have added it to the list above as 3.3. (see https://github.com/plazi/arcadia-project/issues/144#issuecomment-621471556).

Note to everyone: As stated in the linked post, please feel free to add new activities, but they must be at the 'x.y.' level. Please do not create a new top-level activity

@myrmoteras Please assign someone to write up this task (probably Reto) as it won't write itself.

mguidoti commented 4 years ago

Hi,

Here's my input on what was asked from me. I hope you find this useful. I followed the 50-word limit and on the productivity report I focused on what Donat personally told me to do. If any clarification is needed, please, just ping me. The productivity is a bit long but I tried to keep it as summarized and focus on the Arcadia report as possible (although I do have a lot to tell from that data).

Lycophron

49 words

Lycophron is a Python script develop to batch upload publications to Zenodo from a spreadsheet containing their metadata. After the last Arcadia Sprint Meeting the Zenodo team asked Plazi to merge Lycophron with their Batch Uploader. This is an ongoing project that will be finished in the upcoming months.

Repo: https://github.com/plazi/lycophron

O3RT

48 words

O3RT is a Open Refine Reconciliation tool that queries Refindit retrieving the most likely DOIs for publications based on their title. This was created to avoid DOI publication when batch uploading publications to Zenodo. The user then can manually select or evaluate the choices made by the program.

Repo: https://github.com/plazi/O3RT

Learning Sources

47 words

Screecasts describing the process of treatment extraction from a single article were recorded with subtitles in English available, these under proof-reading. Screencasts explaining the template creation process and the quality check are planned. A 8-hour long minicourse was developed, with 20h and 40h versions under currently development.

Currently Available Screencasts: https://www.youtube.com/playlist?list=PLFbvkmnvLdUdGmmn8SR4xyRRxulvVu7BE

mguidoti commented 4 years ago

Plazi@Poa Productivity Report

Last Update: 2020-05-09

TL;DR

Disclaimers

Raw data

As I explained during my presentation in last February, we're using an app called Notion.so to keep track of the production rate. Then, I run a script I coded to download some data from the TB API to complete the database (TBtoNotion). These are the data fields I collect from TB API:

Teams Time Management (Main focus)

Person Nov/2019 Dec/2019 jan/2020 feb/2020 mar/2020 apr/2020 may/2020
Tatiana training training templates learning mat Acta/Corona Viruses Project Corona Viruses Project Corona Viruses Project
Carolina training training templates NI NI/RSZ Zootaxa new EJT backlog
Valdenar training learning material/training learning material learning material Linzer backlog EJT new/Linzer backlog EJT new and backlog/Linzer backlog
Felipe - - training/templates IM/ZS/templates IM/ZS/templates templates ZK/templates

On Templates

It dropped from ~10h/template to ~4h/template last month due to a express template creation workflow we established. The big difference is the reduction of the amount of papers on the testing sample, and the amount of time we initially spend investigating how suitable the journals are for templating. However, we still document the whole process, which culminates in a guide highlighting the biggest issues and challenges on working with a particular journal.

At this point, I don't see how we could improve this production rate. I think we hit a limit considering the quality/time spent trade-off.

Plus, GGI improved greatly from our direct feedback to Guido and his work in the past few months. Sometimes we had to wait a little bit until we could go back and work with a given journal, which in my opinion was expected as journals vary a lot in terms of layout and no one before us put this amount of time into templating. We compiled a list of these improvements caused/motivated by our input here.

Data Summarized

On Article Processing

Here I summarized data considering dedicated hours, number of pages processed and number of treatments produced. Data might not meet TB's for several reasons - to name one, for instance, the latest ZooKeys uploaded by the office was downloaded by Guido to the server, which introduced a given first upload date that is different than the final date used by us in Notion, or, sometimes we change this date for the Concluded articles to fit additional edits we had to make (e.g., errors that we notice on GBIF).

From these, I think the only production rate statistic that makes sense is number of pages in an hour because articles and treatments vary in number of pages, but pages are an entity on its own. Yes, they can vary in complexity but still it's the most comparable entity that we have. Thus, this is the production rate I'm presenting here, although the ones based on treatments and articles were included in the data summary files below.

In April we adopted a new level of granularity, which we call aleluia-level. It's the lowest level of granularity that the office operated since we started. Thus, the order of levels adopted thus far, from highest to lowest, is:

  1. High-level
  2. Low-level
  3. Aleluia-level

Based on this, we improved from 17 pages/h in March to 47 pages/h in April (aleluia-level started) to 79 pages/h in May (only Aleluia-level). You can check how each of the levels contributed to the production rate in one of the data summaries below.

One point that I still have to make is the fact that it's not only the granularity that impacts production rate. These are important factors that impact the production rate as well:

This means that a low-level granularity on a inconsistent journal can take more time than a high-level granularity in a 'good' journal, like EJT (which is, by the way, the only journal being processed in high-level since March).

Granularity levels and changes on focus and how we operate were extensively discussed with Donat, and we followed every decision made in these conference calls, as well as last-minute requests and changes.

I also think we hit the best trade off between data quality/production rate and we won't be able to improve much further. Maybe with time and experience we will get a bit faster. We guarantee that there are no missing neither broken treatments, especially new species, at the current workflow and production rate. The same can't be affirmed for Zootaxa, for instance.

Data Summarized

myrmoteras commented 4 years ago

@mguidoti thanks for the report and all the work by you and your team.

lets discuss it overall once we submitted the report.

In Arcadia, the reporting unit is number of treatments, number of new species liberated, and number of templates. For this reason, you can't change the reporting system to number of pages only, and I would like to ask for these numbers. I made an effort here.

What did you do in regards of teaching? how many screencasts, other teaching materials we have?

It would be helpful if the list of treatments is accessible on Plazi communications github, and with the proper name of the journal, so to link it with the production statistics in plazi stats, eg Journal of Paleontology.

How do you define a finished template? Shouldn't it be that it successfully can process a batch of XX articles to the aleluia level, and then we could use this stats that it works?

Based on the 79 pages per hour, that translates in 39 treatments per hour (700K pages, 38K treatments). The process 250K treatments in year 3, we need 6,410 hours. using 1890 working hours per year, this would need 3,4 persons dedicated for this work.

Regarding graphics: Yes, we can't include more than one graphic. But for internal communications that is helpful nevetheless and should be considered.

Please also consider what data we regarding the production we can move to https://github.com/plazi/Plazi-Communications/wiki/Statistics so we have one central place with the figures, may including figures.

myrmoteras commented 4 years ago

@mguidoti Regarding Lycophron: we mentioned this already last year. in the report. Where do we stand with the import of taxodros that we planned to upload in y2?

mguidoti commented 4 years ago

Hi Donat,

In Arcadia, the reporting unit is number of treatments, number of new species liberated, and number of templates. For this reason, you can't change the reporting system to number of pages only, and I would like to ask for these numbers. I made an effort here.

On templates

The number of templates was given already, by Skype. It's 24 and here's the list. This was also added here, as you asked.

On treatments

The number of treatments and species liberated you said yourself that you would get this from TB as TB has data coming from other sources too (Pensoft, you, Jeremy). What I'm offering here is a way of predicting what we will accomplish in the upcoming months (say, until August).

Number of treatments doesn't work for these estimates. One treatment can have a single line (catalogs, lists of distributional records) or many pages (full description). The time we spend, then, varies in orders of magnitude. Number of new species also makes no sense to be used in actual estimates as we can't predict the amount of new species that will exist in a given PDF. And we don't manually track the number of new species, we rely on TB. This is my opinion. But the treatments/h rate, even if doesn't really makes much sense on my humble opinion, was included in the shared data summary files from my previous post. Here they were included too, to facilitate.

Small note: we went from 11 treat/h to 27/h when we started the aleluia-level of granularity.

Plus, I'm not changing the reporting system, I'm showing an educated opinion based on actual production data after months of this operation. This is the first time we have this. When you guys wrote the project this wasn't present. Data can't be completely ignore, specially when it tells a different story - that's what the scientist on me tells me to say and that's why I'm highlighting the right things, on my opinion. As said previously, the numbers you need and want for the report you said you would get from TB, which makes sense.

Finally, always keep in mind that there are some differences between the absolute number of treatments on our data summaries and the one you get from TB. This is normal and expected by a platoon of reasons, some cited on my original post here.

Let me know if this helps. If not, tell me what's missing again and I'll do my best.

mguidoti commented 4 years ago

What did you do in regards of teaching? how many screencasts, other teaching materials we have?

  1. We did 42 screencasts, covering individual processing of articles only at this point. On this link we can see the number of videos too.

We also produced a list of commented steps for both individual processing and template making. This is intended to be part of the updated version of the manual, a halted project for now, due to all the other priorities that we have both individually and as an office. This, again, was not a decision taken by us, but it was discussed.

The courses production also halted. As you can see, we changed the focus of Valdenar and Tatiana in feb/2020 to other things. We have a draft of an outline for the 40h course, by Tatiana, waiting my and ultimately yours review.

I didn't specify all of these details on that provided paragraph in an attemptto follow the 50 word limit Puneet told me to.

mguidoti commented 4 years ago

It would be helpful if the list of treatments is accessible on Plazi communications github, and with the proper name of the journal, so to link it with the production statistics in plazi stats, eg Journal of Paleontology.

Ok, I can definitely add a list of all template-based processing done by the office there. I just add it on my task list.

On journal name problems

We manually fix the journal names (and change the value on the template) to normalize this as soon as we notice an error.

But it was agreed that we would run something on the server to catch and perhaps fix this automatically in the future. It was said that we would worry about it after the project regarding authors names as we would learn a lot from it, and we could apply that experience to deal with this particular problem.

mguidoti commented 4 years ago

How do you define a finished template? Shouldn't it be that it successfully can process a batch of XX articles to the aleluia level, and then we could use this stats that it works?

From our experience, we can't simply blindly trust a batch process. Looking at Zootaxa papers done this way you can see that even when the dwc file is produced, there might be not one but lots of treatments missed altogether inside that paper, including treatments about new species. This happens for several reasons and none of them that can be predicted and worked by Guido - journals simply vary too much, even from paper to paper. Templates can't predict all of these variations. It's just a blueprint.

The right way to process articles, guaranteeing a minimum level of data quality, is to open every and each IMF, correct errors including the ones from QC (blockers + one critical regarding treatment) and only then uploading. On automation, we could batch process on the server. This would save some time, but I would never get rid of a human checking it before uploading the data. Especially when we start issuing DOIs for each and every treatment on the fly.

Again, that's what our experience from the front lines teaches us.

mguidoti commented 4 years ago

Based on the 79 pages per hour, that translates in 39 treatments per hour (700K pages, 38K treatments). The process 250K treatments in year 3, we need 6,410 hours. using 1890 working hours per year, this would need 3,4 persons dedicated for this work.

That's exactly why the treatments/h rate doesn't work. You overestimated according to our data. Our data shows a 28 treat/h production rate, not 39. You can't do the pages/treat, this doesn't mean much when treatments can have 1 paragraph or several pages.

Following your reasoning and the actual data (28 treat/h), we need 8,928 hours. Considering the 1890 working hours per year, we would need 4.7 people. Right now we have 4 people in the office, but 1 is allocated to work on the Corona Viruses Project, and the other is basically doing templates only. The time to produce enough templates to have enough papers to process and get treatments also needs to be considered in your estimates. This means that we have only 2 exclusively focused on extracting treatments and this will last until the next last minute request.

We are short on human power when we look at the real data, not pages/treat.

Another thing that has to be considered is the 'collector curve effect' that we will have. I explain: we're targeting journals that we know are treatment rich, and describes many new species. But there is a limited amount of these journals. Soon enough, we will be mostly dealing with journals that aren't that treatment-rich, which means that the time processing a paper and creating templates will be less effective in terms of treatment production. What I mean by that is that there is a drop in the treatment production rate on the horizon that should also be considered, somehow, into these estimates.

That's why I'm pushing forward our statistics. They came from real experience in the processing facility.

mguidoti commented 4 years ago

Regarding graphics: Yes, we can't include more than one graphic. But for internal communications that is helpful nevetheless and should be considered.

Please also consider what data we regarding the production we can move to https://github.com/plazi/Plazi-Communications/wiki/Statistics so we have one central place with the figures, may including figures.

Ok, will do.

myrmoteras commented 4 years ago

Hi Donat,

In Arcadia, the reporting unit is number of treatments, number of new species liberated, and number of templates. For this reason, you can't change the reporting system to number of pages only, and I would like to ask for these numbers. I made an effort here.

On templates

The number of templates was given already, by Skype. It's 24 and here's the list. This was also added here, as you asked.

that means the figure 47 is the number of templates, for 24 different journals?

mguidoti commented 4 years ago

Hi Donat,

In Arcadia, the reporting unit is number of treatments, number of new species liberated, and number of templates. For this reason, you can't change the reporting system to number of pages only, and I would like to ask for these numbers. I made an effort here.

On templates

The number of templates was given already, by Skype. It's 24 and here's the list. This was also added here, as you asked.

that means the figure 47 is the number of templates, for 24 different journals?

Yes, and some are not Status=Completed. This spreadsheet is the raw data.

Here you can see three entries for the same journal, for different purposes. We do like this to track the time for each template we produce, regardless the journal. image

In sum, we cover 24 different journals.

mguidoti commented 4 years ago

@mguidoti Regarding Lycophron: we mentioned this already last year. in the report. Where do we stand with the import of taxodros that we planned to upload in y2?

You haven't mentioned the fact that Lycophron is not a Plazi tool anymore, but it will be the official Zenodo Uploader, developed by the two combined. It shows acceptance of the idea by a CERN initiative. I think it's relevant. That's what I wrote in my 50-word long paragraph.

Taxodros is different than Lycophron, especially now that it outgrow the project. Taxodros project, however, will use Lycophron.

For taxodros, I've to:

Why I haven't done this yet: office management, corona project, CBZ, fabricius types, doi registration, organizational tasks, BLR, etc.

Everything that pops up is urgent, basically, and I've to stop what I'm doing and work on that. I planned and scheduled with Zenodo team to work on Lycophron this month, after the report (on my mind). But it seems that you will want to change the office workflow again, so I'm assuming that we will put lots of hours into discussing or passing the new instructions to the people, while organizing the workflow in the current tools and so on. I'm looking forward to discuss this on the light of the data from the processing facility.

I hope this answers. If you need anything else, please, just ping me.

Cheers,

myrmoteras commented 4 years ago

@gsautter where are the repository for the code of the quality control and the template tools? We need to provide this link.

gsautter commented 4 years ago

QC: https://github.com/gsautter/goldengate-qualitycontrol/ (with base classes in https://github.com/gsautter/idaho-core and https://github.com/gsautter/idaho-imagemarkup , the document listing for the QC Tool in https://github.com/gsautter/goldengate-server-imagemarkup/ , and the error stats a currently a custom implementation due to a few dependency issues I want to keep out of our main repos)

Template Tools: https://github.com/gsautter/goldengate-imagine-plugins/ (with base classes being spread out between https://github.com/gsautter/idaho-extensions and https://github.com/gsautter/idaho-imagemarkup )