Decide on attributes of papers to recorded

zachary-foster commented 7 years ago

Here is a start based on our previous discussions, but we should put this in an Rmd.

Paper attributes

Raw data accessibility:
- online
- publicly accessible (i.e. dont have to email anyone, have an account anywhere, or pay anything)
- well annotated so its understandable independent of the methods/paper
Computational methods:
- publicly accessible (i.e. dont have to email anyone, have an account anywhere, or pay anything)
- Using open source, free software
- All scripted; no manual editing or point and click
- version controlled from start of project
- well annotated so its understandable independent of the methods/paper

Journal attributes

Impact factor
country
page charges
Open/restricted access
issues/year
presence/absence of instructions encouraging reproducibility
presence/absence of supplementary material section

emdelponte commented 7 years ago

@zachary-foster and @adamhsparks

I suggest that we focus only on a selected set of journals (our expert judgment or asking others to review would suffice), which are the primary choice for most plant pathologists - recall that we will be submitting this potential manuscript to the leading plant pathology journal. I made a quick list below based on Adam's previous selection.

Given that most are applied, I suggest a different categorization as well. If you agree, let's check if these categories are correct and each of us could pick a set of around seven journals to scrutinize.

Based on my experience, focusing on raw data accessibility and computational methods, most will fall in a "not reproducible" category.. and if that is true the work will be quite quick so that we could increase the number of articles per journal up to, say, 20? hence, 400 articles! let's randomly select 100 articles (Adam's code) per journal and then we decided later on where to stop recording in order to be consistent.

Journal name	Scope	Research aspect
Australasian Plant Pathology	Broad	Applied
Canadian Journal of Plant Pathology	Broad	Applied
Crop Protection	Broad	Applied
European Journal of Plant Pathology	Broad	Applied
Forest Pathology	Specialized	Applied
Journal of General Plant Pathology	Broad	Applied
Journal of Phytopathology	Broad	Applied
Journal of Plant Pathology	Broad	Applied
Journal of Plant Virology	Specialized	Applied
Molecular Plant Pathology	Broad	Fundamental
Nematology	Specialized	Fundamental/Applied
Physiological and Molecular Plant Pathology	Broad	Molecular
Phytoparasitica	Broad	Applied
Phytopathologia Mediterranea	Broad	Applied
Phytopathology	Broad	Fundamental/Applied
Plant Disease	Broad	Applied
Plant Health Progress	Broad	Applied
Plant Pathology	Broad	Fundamental/Applied
PLoSONE	Broad	Fundamental/Applied
Revista Mexicana de Fitopatología	Broad	Applied
Tropical Plant Pathology	Broad	Applied

zachary-foster commented 7 years ago

@emdelponte

I suggest that we focus only on a selected set of journals

Im fine with that, as long as the articles are selected randomly.

Given that most are applied, I suggest a different categorization as well.

Yea. Perhaps we should be categorizing individual articles as e.g. "applied" vs "molecular", instead of the journal. Many journals would accept both types of articles (e.g. PlosONE).

each of us could pick a set of around seven journals to scrutinize

Do you mean scrutinize the journals in order to determine journal attributes or that we each get seven journals to read papers from? If the former, I agree. If the latter, I think we should randomly pick who reads what article independent of the journal so that the reader is not a confounding factor with journal.

let's randomly select 100 articles (Adam's code) per journal and then we decided later on where to stop recording in order to be consistent.

I like that plan. We should make "goals" that we all have to meet before anyone reads more papers. That way no one does more work than needed. For example, we can start with 20 articles each and once everyone has read 20 articles we increase the goal to 30 and so on until we get tired of it. An issue for each goal would work well to keep track of progress.

emdelponte commented 7 years ago

@zachary-foster @adamhsparks

Yea. Perhaps we should be categorizing individual articles as e.g. "applied" vs "molecular", instead of the journal. Many journals would accept both types of articles (e.g. PlosONE).

I liked this simplification and categorization at the article rather than the journal level - Both types will definitely be found in a same journal. Both levels can be used. We will have a better sense of the categories at the article level during the work. This could include pathogen description, population biology, epidemiology, management, etc. so that we could identify which kind of study authors are more prone to make it more reproducible? anyway, the simpler the better but let's see how it goes!

Do you mean scrutinize the journals in order to determine journal attributes or that we each get seven journals to read papers from? If the former, I agree. If the latter, I think we should randomly pick who reads what article independent of the journal so that the reader is not a confounding factor with journal.

Yes, I meant the scrutiny only after randomly selecting them. For journals like PLoS and Crop Protection we also need to define if we will skip those articles that are not plant-pathology related, which will be the most common case. The same for specialized journals such as Nematology, etc. when no plant pathogen/disease is involved.

adamhsparks commented 7 years ago

Yea. Perhaps we should be categorizing individual articles as e.g. "applied" vs "molecular", instead of the journal. Many journals would accept both types of articles (e.g. PlosONE).

This was my original intent

adamhsparks commented 7 years ago

@zachary-foster @emdelponte @grunwald I'm finally getting back around to this and will dedicate some time this week to this work. From what I read here, you guys have captured my original ideas and clarified them much better than I had managed to.

My take on this is that we need to decide on our journals we're sampling from, the list @emdelponte gave above is a good start. If we drop one from the list, we have twenty, that leaves each of us with five journals that we can select our articles at random from (four replicates of five journals each, if you will).

Then each of us can use my code to randomly select the articles from our respective set of five journals.

I'm happy to leave it to each of our discretion how to categorise the articles. I suspect we'll have some that fall into two categories or maybe even three.

@zachary-foster's paper attributes good for me. I do agree with @emdelponte, that if we define reproducible as having raw data and computer code available, etc. that most everything will fail. But, we can make that the gold standard and see if we can find anything that reaches that level. This means that we should have some basic categories for reproducibility as well.

Gold
- Raw data accessibility (3):
- online and publicly accessible (i.e., don't have to email anyone, have an account anywhere, or pay anything)
- well annotated so it's understandable independent of the methods/paper
- Computational methods (3):
- publicly accessible (i.e., don't have to email anyone, have an account anywhere, or pay anything)
- using open source, free software
- software is cited properly
- all scripted; no manual editing or point and click
- version controlled from start of project
- well annotated so its understandable independent of the methods/paper
Silver
- Raw data accessibility (2):
- available, but may need an account to login and retrieve or contact an author
- might require accompanying paper to understand
- Computational methods (2):
- available, but may need an account to login and retrieve or contact an author
- may use proprietary software
- attempts to cite software used
- all scripted; no manual editing or point and click
- might require accompanying paper to understand
Bronze
- Raw data accessibility (1):
- perhaps not available but paper contains enough description to reproduce most work
- Computational methods (1):
- available, but may need an account to login and retrieve or contact an author
- uses only proprietary software
Not passing
- Raw data accessibilit (0):
- not available or explicitly not made available
- poorly described, cannot reproduce any work
- Computational methods (0):
- no citations of software used
- no scripts or code or analysis available
- not scripted, requires manual editing or point and click

emdelponte commented 7 years ago

@zachary-foster @adamhsparks @grunwald

I liked these categories. How do you visualize the data frame? an article per row and how about the columns? and values to assign? binary or ordered scores (e.g. for computational method: 0 - no script; 1 - login needed; 2 - public accessible). Could the final category be decided based on median score?

I vote for removing PLoS from the list - the only not explicitly related to Plant Pathology.

For Crop Protection, one should skip articles that do not deal with plant disease.

adamhsparks commented 7 years ago

@emdelponte, I think that's a reasonable idea, we can assign a value to the categories as you've suggested. I've updated my previous comment with scores based on 0-3 as you've suggested. A gold score would be 6 in this case and silver would be 4, but there might be 5s depending on data vs computational methods, etc.

I second removing PLoS from the list and agree with skipping non-plant pathology articles in Crop Protection.

Here's how I'd envision the data frame structure.

reproducibility <- tibble::tibble(
  Article = "The Area Under the Disease Progress Stairs: Calculation, Advantage, and Application",
  DOI = "PHYTO-07-11-0216",
  Journal = "Phytopathology",
  Authors =  "Ivan Simko and Hans-Peter Piepho",
  Year = 2012,
  Vol = 102,
  Iss = 4,
  pp = "381-389",
  IF = 3.011,
  Journal_class = "Fundamental",
  Page_charges = 130,
  Country =  "USA",
  Open_or_Restricted = "Optional",
  Reproducibility_instructions = FALSE,
  Iss_per_Year = 12,
  Supl_mats = TRUE,
  Comp_methods_availability = 2,
  Software_availability = 1,
  Software_citation = 3,
  Analysis_automation = 0,
  Data_availability = 0,
  Data_annotation = 0,
  Data_tidiness = 0
)

reproducibility <- dplyr::mutate(reproducibility,
                                 Reproducibility_score = sum(Comp_methods_availability,
                                                             Software_availability,
                                                             Software_citation,
                                                             Analysis_automation,
                                                             Data_availability,
                                                             Data_annotation,
                                                             Data_tidiness))

zachary-foster commented 7 years ago

I have adapted the paper attributes with the 4-score system and put it in an Rmd so we can refine it better. Check it out here:

https://github.com/adamhsparks/Reproducible-Research-in-Plant-Pathology/blob/master/reproducibility_criteria.Rmd

@adamhsparks

If we drop one from the list, we have twenty, that leaves each of us with five journals that we can select our articles at random from (four replicates of five journals each, if you will).

I think we should select the articles from all the journals first and then split them up randomly, independent of journal, otherwise the person reading will be a confounding factor with journal.

if we define reproducible as having raw data and computer code available, etc. that most everything will fail. But, we can make that the gold standard and see if we can find anything that reaches that level.

Yes, I would expect very few projects to be entirely reproducible. However, with the 0-3 scoring system, 2 (silver) would still be relatively good.

@emdelponte

binary or ordered scores (e.g. for computational method: 0 - no script; 1 - login needed; 2 - public accessible).

I like the idea of a 0-3 scoring with 3 being exceptional and 1-2 being typical.

I vote for removing PLoS from the list - the only not explicitly related to Plant Pathology. For Crop Protection, one should skip articles that do not deal with plant disease.

Agreed.

adamhsparks commented 7 years ago

I think we should select the articles from all the journals first and then split them up randomly, independent of journal, otherwise the person reading will be a confounding factor with journal.

Agreed

I've edited the Rmd file to include the example tibble with @zachary-foster's updated suggestions for reproducibility categories

https://github.com/adamhsparks/Reproducible-Research-in-Plant-Pathology/blob/master/reproducibility_criteria.Rmd

Where do we now categorise SAS? There is a free University edition for download or use with AWS Cloud. Having looked at it, I think it might now fall into a 2 rating. You have to sign up, login, etc., so it's free but it's still proprietary.

grunwald commented 7 years ago

While SAS is finally free, one cannot reproduce publication ready graphs in SAS. So SAS should get a lower score. Also, code is not open source and cannot be improved by SAS user community.

grunwald commented 7 years ago

I vote for removing PLoS from the list - the only not explicitly related to Plant Pathology. For Crop Protection, one should skip articles that do not deal with plant disease.

I would include PLOS indirectly by selecting 'plant pathology' articles randomly?

grunwald commented 7 years ago

binary or ordered scores (e.g. for computational method: 0 - no script; 1 - login needed; 2 - public accessible). I like the idea of a 0-3 scoring with 3 being exceptional and 1-2 being typical.

I like the 0 (no code) to 3 (fully Open Source code)

zachary-foster commented 7 years ago

Where do we now categorise SAS? There is a free University edition for download or use with AWS Cloud. Having looked at it, I think it might now fall into a 2 rating. You have to sign up, login, etc., so it's free but it's still proprietary.

I think a score of 2 sounds about right. Its easily available, but proprietary. Not being open source makes it less reproducible even if it is free (e.g. you don’t know how a change in version would affect results or the details of how algorithms are implemented).

grunwald commented 7 years ago

Where do we now categorise SAS? There is a free University edition for download or use with AWS Cloud. Having looked at it, I think it might now fall into a 2 rating. You have to sign up, login, etc., so it's free but it's still proprietary.

Also keep in mind that SAS is not OA code. Also, cannot produce graphs for publication. Free not equal to Open.

adamhsparks commented 7 years ago

Yes, free != open, that's why I raised the issue.

@grunwald is right, I'd forgotten that you can't make graphs for publication. That is part of being reproducible. Maybe that does warrant a lower ranking? I don't want to be seen as saying "SAS is bad" to use, but...

adamhsparks commented 7 years ago

It seems that we are happy with this scale, let's move forward using this and see how we go, also, I don't have any other journal attribute suggestions to make either, so we can record those as well.

adamhsparks commented 7 years ago

As I'm going through this list that @emdelponte proposed, MPMI isn't listed here.

I think I need to rerun our list and add MPMI in, that's a pretty big journal to omit.

adamhsparks commented 7 years ago

@emdelponte, I am not finding a "Journal of Plant Virology" as you have suggested.

I've elected to go with http://www.virologyj.com/sections/plant for these articles.

emdelponte commented 7 years ago

@adamhsparks oops! it seems you found the right name! there is also Archives of Virology, but we should be OK with one representative of the field.

I agree with including MPMI.

adamhsparks commented 7 years ago

I'm almost done with the list. I'll finish up this evening and make a commit with our assigned articles.

adamhsparks commented 7 years ago

Closing this to clean up issues.

See: https://github.com/phytopathology/Reproducible.Plant.Pathology/blob/master/vignettes/reproducibility_criteria.Rmd for reproducibility criteria

See: https://github.com/phytopathology/Reproducible.Plant.Pathology/blob/master/vignettes/Assigning_Articles.Rmd for article assignments for each of us

openplantpathology / Reproducibility_in_Plant_Pathology

Decide on attributes of papers to recorded #5

Paper attributes

Journal attributes