plazi / arcadia-project

2 stars 1 forks source link

use the treatment dictionary at Zenodeo as cristallization point for the treatment paper #146

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

@punkish had a great idea that might help to get us the treatment paper done. Lets's use the zenodeo data dictionnary, add what taxpub has, add what we have in GG XML, in @mguidoti manual and references, notes about treatment at Zenodo, and all the other bits and pieces of treatment paper stubs.

Let's make this the declared goal for September1, the end of the 2nd Arcadia grant operational year.

punkish commented 4 years ago

Well, let's modulate that a bit.

In the recent cov3 update, I created a new data dictionary. The whole thing starts starts from this script, but the exact start of the dd is on line 302. The main motivation was to move from creating all the logic in code to keep the logic in a description (even though dd.js is code, I think of it more as a description because there are not actions in that file, only statements). That file actually pulls in the data dictionaries for the individual resources, for example, treatments. I have tried to summarize my logic in this short write-up about the importance of a data dictionary to what I do, aka the importance of configuration over code. This central place where the description of all data starts, in turn, powers all other subsequent logic. Want a schema to validate incoming queries? dd2schema.js – a program powered by the data-dictionary. Want to create the SQL queries that query the database? dd2queries.js – another program powered by the data-dictionary. In fact, want to create a data-dictionary that combines all the disparate data descriptions and creates something that is more like a program? You guessed it, dd2datadictionary.js – a program powered by the data-dictionary.

Back to this issue – what I suggested was that we should have a publicly available, scrutinizable, reviewable, versioned, and hence, frozen in time, treatments data dictionary. Not for my use, but to serve as the very underpinning concept of all the work we do. Like a regular, real world language dictionary, this dictionary too can and will change over time, but it will be versioned, so we will know which one we are using in every instance. This dictionary should also be a part of a paper that describes our contribution to the data-science part of taxonomy. In many ways, we are doing (perhaps we have invented) computational taxonomy, or we are getting there. Treatments are our most significant contribution to this process, and have no been adopted as a first-class data citizen by Zenodo and GBIF. We need to publicize this, and the best way to do that is via a paper that is founded on citable, versioned data principles – a data-dictionary.

mguidoti commented 4 years ago

I suppose you guys are discussing the more technical treatment paper, not the one addressing taxonomists... right?

punkish commented 4 years ago

I suppose you guys are discussing the more technical treatment paper, not the one addressing taxonomists... right?

no, just the opposite. We need a dd addressing taxonomists, the fundamental description and rules of what the heck this thing called a treatment is. See this project also. This will be the scientific paper that describes a scientific concept that is then used by programmers (among others) to create programs that automate different aspects of this taxonomic concept.

As such, this dd may not look anything like my dd.js, but it will be similar in spirit, just a bit more meta (and readable by scientists). It will, however, enable the creation of something like my dd.js and any other subsequent programs, as applicable, and enable this in an orderly way rather than an ad hoc way.

mguidoti commented 4 years ago

So, Puneet, Donat and I have been discussing this paper for a long time as well and we have come to the realization that a paper technically describing the concept and its implementation is not the one that will 'sell' the idea to the end users, which are by far composed by people very tech illiterate.

To target the most important people in this equation we should have one separate paper highlighting the potential usages of the concept, with real-life cases (e.g. Torsten paper, some of our side projects) and barely touching any technical aspect - but referencing for this technical paper to the ones interested in dig deeper.

As a taxonomist, I'm absolutely sure that you can't hope to pass this important message to this community by speaking a different language than your audience. Both angles are extremely important, but with different audiences in mind. Hence, the idea of two papers.

punkish commented 4 years ago

yes, it is a very old topic, some of it that I remember from the very beginning of my association with Plazi (years old). I agree that this paper will not be technical. I think I have said that very clearly that this paper will be scientific, not technical. Which is why I started my first post above with "Well, let's modulate that a bit." (Maybe I was being understated.)

But a scientific paper can describe how to recognize something, what it is, where it fits in the hierarchy. After all, that is what taxonomists do. We need to describe the treatment in scientific terms. That is the only paper I am talking about. The data-dictionary in this paper will not be computational, but it will lay the basis for creating something computational.

That said, yes, a technical data-dictionary is also needed. That may (or may not) have a supporting paper.

mguidoti commented 4 years ago

Still, regardless how you label it, on my opinion as I also said very clear, this is going to be still too technical for the community.

Taxonomists will only 'buy' the idea if you show clear real-life applications of it. Dictionaries, how the idea was born, and etc, will not interest most people. We have to show how this aggregates value to their research, regardless the time they will have to put into it, which shall not be ignored (it's a lot). We have to show how they can use treatments to produce more papers for their own careers and interests. That's the only way to grab the desired attention of the end users. People will not buy it by the elegance of the concept, or how theoretically it makes sense, but by its real applications and the trade-off of time/reward they will have.

The paper you're talking about is indeed important and should come out first, but it's not what we need to make the bridge with who matters in my opinion. Again, regardless how you're labeling, this is what at least Donat and I have been calling, to each other, the 'technical' paper.

punkish commented 4 years ago

well @mguidoti, I don't want to get mired in arguments about classifying the paper. If you think it is a "technical" paper, well then, it is a technical paper. The fact is, this is something that has been discussed for years and never done because of a variety of reasons, mostly other pressing issues. We should definitely not let semantics become another block.

mguidoti commented 4 years ago

It wasn't my goal, at all... I'm just saying: 'great, but please, let's not forget that we've to speak the end user language at some point too.'

Best,

punkish commented 4 years ago

I'm just saying: 'great, but please, let's not forget that we've to speak the end user language at some point too.'

No where in this issue has there been any suggestion of otherwise, at least not intentionally.

Here is a conceptual overview of this issue

┌──────────────────────┐                                                 
│                      │                                                 
│  this paper          │                                                 
│                      │                                                 
│                      │                                                 
│    ┌────────────┐    │                                                 
│    │ scientific │    │                                                 
│    │  concept   │    │                                                 
│    └────────────┘    │                                                 
│           │          │                                                 
│           │          │                                                 
│           ▼          │                                                 
│    ┌────────────┐    │                                                 
│    │  logical   │    │                                                 
│    │description │─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─         
│    └────────────┘    │                                        │        
│           │          │                                                 
└───────────┼──────────┘                                        │        
            │                                                            
            │                                                   │        
            │                                                            
 ┌──────────┼───────────┐                                       ▼        
 │          ▼           │                               ┌ ─ ─ ─ ─ ─ ─ ─ ┐
 │   ┌────────────┐     │                                     other      
 │   │programmatic│     │                               │ programmatic  │
 │   │ dictionary │─────┼──────────────┐                   dictionary    
 │   └────────────┘     │              │                └ ─ ─ ─ ─ ─ ─ ─ ┘
 │          │           │              │                                 
 │          ▼           │              │                                 
 │   ┌────────────┐     │   ┌──────────┼───────────┐                     
 │   │  programs  │     │   │          ▼           │                     
 │   │            │     │   │   ┌────────────┐     │                     
 │   └────────────┘     │   │   │ derivative │     │                     
 │                      │   │   │ dictionary │     │                     
 │     GGI family       │   │   └────────────┘     │                     
 │                      │   │          │           │                     
 └──────────────────────┘   │          │           │                     
                            │          ▼           │                     
                            │   ┌────────────┐     │                     
                            │   │  programs  │     │                     
                            │   │            │     │                     
                            │   └────────────┘     │                     
                            │                      │                     
                            │       zenodeo        │                     
                            │                      │                     
                            └──────────────────────┘                     

Right now, there is no written concept and no first-level dictionary, as far as I know. So, when I started developing the Zenodeo dictionary, I had to depend upon the oral input from @myrmoteras @tcatapano @gsautter and @mguidoti, and then most of us had a chance to iterate on that. Since Zenodeo is dependent on @gsautter's work, it is not going to be directly affected by the contents of the proposed paper. It will always be dependent on the data that @gsautter extracts. So, that would be the first-level programmatic dictionary.

But, if such a paper were to exist, then we could theoretically see someone else read it, create their own first-level programmatic dictionary, and develop their own applications to extract treatments from papers, or from some other form of content. We can't see the future. But we can lay the foundation for letting many flowers bloom, not just one flower (and not a thousand flowers too, as that would be too optimistic).

Anyway, let's write the darn thing and then we can decide who it is for and whether the paper is submitted to the journal of insects or to the journal of bits and bytes.

Please see this project for keeping tabs on the progress and this repo for the paper itself

myrmoteras commented 4 years ago

There is not ONE paper.

Making treatments popular is a heavy lifting. We need to explain the concept, we need to explain the technical aspects, define the terms. We need examples, such as what Jeremey is writing up on TRex.

What puneet is suggetion is the technical description, including the covabulary, the metadata we use in BLR, may be taxpub, may be special issues in GGXML. This is also relevant to popularize this new subtype of the DataCite type (terry has the proper language here).

For the more scientific community, there are several notes floating around.

lets make this a goal for May to September

punkish commented 4 years ago

ok, now that I understand the concerns of @myrmoteras and @mguidoti, let there be two papers, or more papers. Yes, I am suggesting that we have a first-level data dictionary, but I am also suggesting that we have a conceptual description of a treatment also codified in a peer-reviewed paper. So there are two… let's get them done.

I feel we have struggled to get even one paper out, so we should not get stuck on becoming greedy and already start thinking of two or more papers, but whatever. Let's get one out. Let's start writing and then things will become clear what it will become. Maybe we can branch it off as needed into another paper.