Dear @open-sci/the-leftovers-2-0,

Please find here attached my comments of all your material. You have to address all of them and, once finalised, to close this issue with a comment containing your reply to each of the points I have highlighted. There is no specific deadline to complete this task, thus please take your time.

Please, be aware that some modifications in some document may affect also modifications in other documents. As a final note, please remember to keep your notebooks up-to-date.

After closing this issue, please remember to update your material.md file by specifying the references to the new version of all your documents.

As usual, for further doubts, do not hesitate to contact me in the Signal group or just comment this issue here.

General comments from the presentation

The file names for input/output of the process are are embedded in the code. Could you change it so as to enable anyone to specify as input its favourite paths when running the scripts - e.g. by specifying some input parameter?
It would be important to keep track of historical data (a.k.a. provenance information), in order to check when a certain DOI become valid or invalid.
You have shown that a some prefixes are not available in Crossref. That is true, but can you envision some way, even manual if feasible, to use to retrieve information about that publishers? Indeed, Crossref does not contain all data, and these are important anyway. It would be important to, somehow, enrich your data with this additional information, so as to have a clear picture of the situation.

DMP

Missing Citations in COCI: Publishers Analytics Result

Section 3.1.8: since you specified (in section 3.1.7) that you will use naming convention, you have to specify which naming convention is adopted.

Section 3.1.15: since we are referring to the output data you will create, I doubt that you will write it in Python Script File and Jupyter Notebooks. Please specify only the formats that are relevant for this dataset (which is not software, I believe).

Section 3.2.5: are you saying that the data you produced as output are also available in the repository dedicated to your software? If so, please say it explicitly.

Section 3.4.4: ISC license does not apply to dataset, only to software. Please, change it accordingly.

Code for Missing Citations Analysis in COCI

Section 2.1: in this context the term "data" should be interpreted as "software". Thus, the question refers to if you are going to reuse existing software for accomplishing your goal. Under this meaning, please be aware that also the Section 2.2 and Section 2.3 could be populated with some information. Please notice that this interpretation of "data" as "software" may affect also (the description of) other points of the DMP.

Section 3.1.15: since we are referring to the code in this part, I doubt that you will write it in CSV. Please specify only the formats that are relevant for this "dataset" (i.e. related to software).

Section 3.4.5: it seems that the description associated does not apply to software here. However, if you have developed some unit tests to check the correctness of your software, they can be considered as "documented procedures for quality assurance".

Protocol

"we got our input material from": the bibliografic reference cited inline is not clear. Please use a specific layout (e.g. italic) or format it in a different way so as to distinguish it from the text.

"extract_row_number(publisher_data)" (and similar): please format the text referring to code in a way which are immediately distinguishable from the other text.

"in the following way: [INSERT EXAMPLE DICT]": there is no dictionary specified there, only a placeholder.

Software

The README.md in the GitHub repository does not contain an appropriate introduction to the software and, in particular, how to and in which order to call the various scripts to run the process correctly. In addition, you should also specify with which configuration (i.e. computer, processor, RAM, HD, etc.) you have run the scripts to get your final output, since this is crucial to foster reproducibility. Finally, if the software is somehow related with other documents (the protocol, the article, website, etc.), please mention them here in the README.md.

Article

"Abstract": it should be the same structured abstract you have prepared in advance. Why here it has a different format?

"are “freely accessible and reusable”1": you add a footnote without adding the URL from which you took that quotation.

"is “a scholarly infrastructure organization...”2": there is no source for this quotation.

"since their DOI is invalid": the citations do not have a DOI, but the DOIs is held by the entities involved in the citation. Clarify better here, since it seems that the DOI is actually assigned to the citation.

"The first one is “Crowdsourcing open citations with CROCI An analysis of the current status of open citations, and a proposal” by Heibi, Peroni and Shotton" (and similar): you have organised all the citations by means of footnote, but I requested to use APA style for referencing and citing. Thus, please, modify all the citations so as to comply with that style - as an example, see how the other group have cited things within their text. Footnotes, in this case, should be used for adding additional material to the discussion and not to cite an article.

"This dataset cis structured": correct that "cis".

"Then, two variables (i.e. : start_index and prefix_to_name_dict)" (and similar): please, use a different style to show the code. In addition, if you are talking about a code, you must provide to the reader some excerpt of it that allows him/her to understand what you are saying. Please, avoid to put in the method section the whole protocol, but focus on the main step, and support the discussion with visuals that help the reader to understand the various passages of the steps followed.

"We are writing this article while the software hasn’t finished running yet": you should update the article with the final data and graphics.

"Figure 1" (and similar): the figures must be readable and contained within the margins. In case it is needed, put them (and only them) in a landscape view.

"Intersection between the first and the second question: auto citation": I think here you are referring to the concept of self-citation, e.g. see https://arxiv.org/abs/1903.06142.

"For the purposes of our research, the point was handled by locally creating a temporary JSON file to store information about each prefix and the publisher which it identifies": however, if I remember correctly, each publisher has a particular Crossref ID that characterises it (e.g. see the field "member" in a request for info about a DOI using the Crossref API). Why did you not use it, which is more reliable than the name?

Thank you for your suggestions. Before addressing the issues, we would like to clarify some aspects:

What do you exactly mean with “historical data” and “provenance information” in point 2 of the “General comments from the presentation”? Is the date of the validation enough or would you suggest to include also other information, for example related to possible metadata provided with the input file (i.e.: the provider of the input file, the date of creation of the input file) and/or with the publications the DOIs identify (e.g.: their dates of publication)?
For what concerns point 3 of the “General comments from the presentation”, is the issue related to the need of making other attempts to identify the publishers behind the prefixes not managed in Crossref? Or does it concern the possibility of enriching the data retrieved in general, for example with more detailed information about the identified publishers?
In order to implement the "procedures for quality assurance" mentioned in “Code for Missing Citations Analysis in COCI”, would it be appropriate to develop unittests based on manually generated test files, so as to test the actual capability of our software to correctly manage both validated and still invalid DOIs?

Hi @ariannamorettj,

Answers inline below.

* What do you exactly mean with “historical data” and “provenance information” in point 2 of the “General comments from the presentation”? Is the date of the validation enough or would you suggest to include also other information, for example related to possible metadata provided with the input file (i.e.: the provider of the input file, the date of creation of the input file) and/or with the publications the DOIs identify (e.g.: their dates of publication)?

I think there are at least two kinds of information that is needed to keep track of, in this case. The first one is, indeed, the date when you did the check if a certain DOI is valid or not. The second is the actual consequence of that check. For instance, it would be possible that either a DOI which was invalid is still invalid, a DOI which was invalid is now valid, a DOI which was valid is now invalid, and a DOI which was valid is still valid. Keeping track of the chain of these activities for each DOI forms what someone during the presentation mentioned as "historical data".

* For what concerns point 3 of the “General comments from the presentation”, is the issue related to the need of making other attempts to identify the publishers behind the prefixes not managed in Crossref? Or does it concern the possibility of enriching the data retrieved in general, for example with more detailed information about the identified publishers?

The point was indeed referring to the need of making attempts other than Crossref to identify the publishers associated with such prefixes. Such an approach can be computational if you find a way to get this information automatically, or entirely manual if you have to do it by hand.

* In order to implement the "procedures for quality assurance" mentioned in “Code for Missing Citations Analysis in COCI”, would it be appropriate to develop unittests based on manually generated test files, so as to test the actual capability of our software to correctly manage both validated and still invalid DOIs?

Yes, this is something that works well, indeed.

Thank you for your answer. We have an issue regarding this point:

Section 2.1: in this context the term "data" should be interpreted as "software". Thus, the question refers to if you are going to reuse existing software for accomplishing your goal. Under this meaning, please be aware that also the Section 2.2 and Section 2.3 could be populated with some information. Please notice that this interpretation of "data" as "software" may affect also (the description of) other points of the DMP.

Since we have used python libraries for developing our software, we would like to specify them and their source files in the point 2.3 of the description of this dataset (i.e., the software) in the DMP. However, the problem is that Argos doesn’t allow us to insert manually the link of the repositories where the source code of the libraries is stored (technically, it is indeed possible to insert manually the links, but after saving the specifications disappear). So we just selected the option “Python code” from the dropdown menu (instead of specifying manually the links) but we know that it is not precisely what we meant. Is it better to put only "Python code" although imprecise, or do you have other suggestions on how to handle this issue?

Since we have used python libraries for developing our software, we would like to specify them and their source files in the point 2.3 of the description of this dataset (i.e., the software) in the DMP.

My suggestion is to put all the descriptive information of all the links to the repo of the various software in the description of 2.1. Please note that the "Python Code" in 2.3 actually refers to a specific record in Zenodo (linked with GitHub) having title "Python Code" - it is not a generic name that anyone can assign as a description of the code you are using. Thus, in this case, my suggestion is to be as much complete as possible in 2.1.

Thank you for your answer. We will proceed in this way then, even if it means leaving point 2.3 blank.

Since we have used python libraries for developing our software, we would like to specify them and their source files in the point 2.3 of the description of this dataset (i.e., the software) in the DMP.

My suggestion is to put all the descriptive information of all the links to the repo of the various software in the description of 2.1. Please note that the "Python Code" in 2.3 actually refers to a specific record in Zenodo (linked with GitHub) having title "Python Code" - it is not a generic name that anyone can assign as a description of the code you are using. Thus, in this case, my suggestion is to be as much complete as possible in 2.1.

The Grasshopper team had the same doubt raised by @saroppini. We will also provide as much information as possible in field 2.1, leaving 2.3 blank, since none of the software reused has a Zenodo record linked with Github.

By closing the issue we add the links to the new versions of the project materials. These materials include the latests corrections we addressed to the Open Science project. Apart from the corrections proposed in the issue which emerged during the workshop, we have also addressed other tasks either came out during the workshop or which we have thought that could be useful in order to better contextualise the project. These further corrections are:

Sankey diagram in order to understand whether there are specific patterns (e.g. self-citations) in che relation between the first ten citing publishers for number of invalid citations and the first ten cited publishers (the diagram is available on both the website and the article);
The Sankey diagram has lead to the generation of a new piece of code in order to manipulate the data in the required way (available in the repository of the group together with the other pieces of code);
a bibliographic reference useful to identify the so called “major publishers”;
We have updated the website with the final graphics obtained from the final results of the software;
We used external APIs from mEDRA and DataCite to automatically extract the names of publishers with prefixes not recognised by Crossref;
We used Crossref member codes as keys to identify publishers with;
We specified the validation timing of each citational data (i.e. the time in which it has been validated or not);
We added the unit tests as a measure of quality assurance;
We implemented a regex pattern matching to spot the prefix in the DOIs resulted to be a suboptimal solution in the case of invalid receiving DOIs, which could actually be invalid because of an error in the prefix format. In order to optimise our results, we decided to implement an if-else structure so to try to match the prefix structure in the DOI before the first "/" character and maximise the chances of making a successful API request for the publisher's identification. Indeed, in this way we avoid wasting the possibility to spot the addressed publisher because of extra characters in the cited DOI. At the same time, we addressed the possibility that the prefix itself is not compliant in the else clause: in this latter case, we just stick to the the very first implementation of the prefix extraction, which was that of considering as prefix all the characters preceding the first "/" character. However, in this case, the publisher won't be identified (for the functions extract_publishers_valid and extract_publishers_invalid).

Hi @Alessia438,

As I anticipated in the issue, you have to

You have to address all of them and, once finalised, to close this issue with a comment containing your reply to each of the points I have highlighted.

Thus, please, extend your previous comment (which is indeed a great abstract presenting what you have done) by including a reply to each of my points – i.e. by quoting each of them and provide an explanation of how you have addressed it.

General comments from the presentation

The file names for input/output of the process are are embedded in the code. Could you change it so as to enable anyone to specify as input its favourite paths when running the scripts - e.g. by specifying some input parameter?

We rewrote some parts of our code in order to make it compliant with the current softcoding norms. In particular, in the very first correction phase, we just moved the input parameters specifications in the function invocation, in the software entry point, to separate it at least from the main function itself. Then, we decided to further improve the reusability of the code by giving the user the possibility to run the software from the command line, without the necessity to directly interact with the code at all. Accordingly, we added a usage tutorial in the README.md file, which explicitly explains how to call the main function with the chosen parameters, which in particular are supposed to be: An input CSV file with the same fields of the one used by us in the present study, but which could contain different citational data, have a different length, being a subset of the original one (e.g.: containing only the citational data which resulted invalid at a previous code run). An output JSON file, where the final computed data will be stored. A number of lines after which the processed information is stored in the cache file. Note that letting the user choose this last parameter results significantly important, since the necessity to save the processed data more or less often is also related to the performance of the system on which the code is run. For the creation of our study output dataset we opted for storing the data to cache files each 100 lines, which was a good tradeoff solution with respect to our necessity and our system’s settings and characteristics.

It would be important to keep track of historical data (a.k.a. provenance information), in order to check when a certain DOI become valid or invalid.

In order to keep track of the historical data, we decided to store the provenance information of each citation as the exact moment in which the datum was defined as valid or invalid. In particular, we opted for storing this information individually for each one of the citations instead of doing it once for the whole input file and saving it as an additional field of the final JSON because the time of a whole run of the code could be on the order of several days, specifically depending on many factors, among which the system’s settings and characteristics. So, even if it was a more expensive choice both in terms of memory and time, we decided to avoid to keep track of a unique validation time for the whole dataset, which would have been reasonably representative of the very last citations only, and very risky in terms of correctness for the very first processed citations.

You have shown that a some prefixes are not available in Crossref. That is true, but can you envision some way, even manual if feasible, to use to retrieve information about that publishers? Indeed, Crossref does not contain all data, and these are important anyway. It would be important to, somehow, enrich your data with this additional information, so as to have a clear picture of the situation.

Initially, we used to identify the prefix of a doi with anything preceding the first slash encountered in an identifier. However, since we realised that some DOIs came with incorrect structures, and sometimes with unforeseen characters in the prefix too, we implemented a regex pattern matching to spot the prefix structure in each DOI. By the way, also this approach resulted in a suboptimal solution in the case of invalid receiving DOIs (handled by the function extract_publishers_invalid), which could actually be invalid because of an error in the prefix format. In order to optimise our results, we decided to implement an if-else structure which tries to match the prefix structure in the DOI before the first "/" character and maximise the chances of making a successful API request for the publisher's identification. Indeed, in this way we avoid wasting the possibility to spot the addressed publisher because of extra characters in the cited DOI. At the same time, we addressed the possibility that the prefix itself is not compliant in the else clause: in this latter case, we just stick to the very first implementation of the prefix extraction, which was that of considering as prefix all the characters preceding the first "/" character. However, in this case, the publisher won't be identified.

DMP

Missing Citations in COCI: Publishers Analytics Result

Section 3.1.8: since you specified (in section 3.1.7) that you will use naming convention, you have to specify which naming convention is adopted.

Since there are no specific or universally required standards for naming conventions for JSON files (which is the format of our output dataset), we have arbitrarily chosen snake case as the naming convention for property names.

Section 3.1.15: since we are referring to the output data you will create, I doubt that you will write it in Python Script File and Jupyter Notebooks. Please specify only the formats that are relevant for this dataset (which is not software, I believe).

We have corrected that part by removing these two items and only leaving JSON format.

Section 3.2.5: are you saying that the data you produced as output are also available in the repository dedicated to your software? If so, please say it explicitly.

Now we made clear that we are referring to the output file. We have also specified in the DMP the URL of the Github repository of our project (https://github.com/open-sci/2020-2021-the-leftovers-20-code).

Section 3.4.4: ISC license does not apply to dataset, only to software. Please, change it accordingly.

In the end we selected the ISC 4.0 license for this dataset.

Code for Missing Citations Analysis in COCI

Section 2.1: in this context the term "data" should be interpreted as "software". Thus, the question refers to if you are going to reuse existing software for accomplishing your goal. Under this meaning, please be aware that also the Section 2.2 and Section 2.3 could be populated with some information. Please notice that this interpretation of "data" as "software" may affect also (the description of) other points of the DMP.

We re-used python libraries and already available materials in the main software:
json library (https://github.com/python/cpython/blob/3.9/Lib/json/__init__.py), csv library (https://github.com/python/cpython/blob/3.9/Lib/csv.py),
re library (https://github.com/python/cpython/blob/3.9/Lib/re.py),
os library (https://github.com/python/cpython/blob/3.9/Lib/os.py),
datetime (https://github.com/python/cpython/blob/3.9/Lib/datetime.py),
unit tests (https://github.com/python/cpython/blob/3.9/Lib/unittest/__init__.py)
and finally “requests” to deal with APIs requests.
Also we reused some javascript libraries in order to realise the graphics with the final data:
D3.js (https://d3js.org/d3.v7.min.js), Chart.js (https://cdn.jsdelivr.net/npm/chart.js), HighCharts (https://code.highcharts.com/highcharts.js).

Section 3.1.15: since we are referring to the code in this part, I doubt that you will write it in CSV. Please specify only the formats that are relevant for this "dataset" (i.e. related to software).

We only kept the Python Script File because we don’t write it either with CSV, Jupyter Notebook or JSON.

Section 3.4.5: it seems that the description associated does not apply to software here. However, if you have developed some unit tests to check the correctness of your software, they can be considered as "documented procedures for quality assurance".

We developed unit tests for checking our software functioning on the overall and in its various components. Unit Testing can be considered as a documented procedure for quality assurance of the data for this kind of dataset.

An important note about the final version of the DMP *: any missing answer in the DMP is attributable to system errors of the Argos platform. Before finalizing and publishing the final DMP, we carefully made sure that there was an exhaustive answer to every question. Unfortunately, once the DMP is finalized, the final pdf still lacks some answers to some important questions.

Protocol

"we got our input material from": the bibliographic reference cited inline is not clear. Please use a specific layout (e.g. italic) or format it in a different way so as to distinguish it from the text.

We corrected the format of the reference cited, as well as for the other references in the article.

"extract_row_number(publisher_data)" (and similar): please format the text referring to code in a way which are immediately distinguishable from the other text.

Given the limited formatting and layout options provided by the platform, we decided to visually differentiate code-related text by using bold characters (for example for variables, functions names…) and the italic for bibliographic references.

"in the following way: [INSERT EXAMPLE DICT]": there is no dictionary specified there, only a placeholder.

We corrected this part of the protocol by showing a subsample dictionary which could be representative of the organization of our output materials.

Software

The README.md in the GitHub repository does not contain an appropriate introduction to the software and, in particular, how to and in which order to call the various scripts to run the process correctly. In addition, you should also specify with which configuration (i.e. computer, processor, RAM, HD, etc.) you have run the scripts to get your final output, since this is crucial to foster reproducibility. Finally, if the software is somehow related with other documents (the protocol, the article, website, etc.), please mention them here in the README.md.

In order to address this point, we enriched the README.md file with all the required information. Accordingly, we added to this file a tutorial about the code launch from the command line and some natural language explanations to make clear which is the code entry point and - more generally - to enhance reproducibility for further reusers. For the same purpose, we structured this document in the subsequent points:

An introductory presentation of the contextualization of our study,
How to Run the Code, a brief tutorial which explains how to correctly run the code,
Hardware Configurations, specifying settings and characteristics of the system on which we run our code,
See Also, which contains all the links to our related resources.

Article

"are “freely accessible and reusable”1": you add a footnote without adding the URL from which you took that quotation. "is “a scholarly infrastructure organization...”2": there is no source for this quotation. "This dataset cis structured": correct that "cis". "since their DOI is invalid": the citations do not have a DOI, but the DOIs is held by the entities involved in the citation. Clarify better here, since it seems that the DOI is actually assigned to the citation. "Intersection between the first and the second question: auto citation": I think here you are referring to the concept of self-citation, e.g. see https://arxiv.org/abs/1903.06142.

We corrected these errors in the article.

"Abstract": it should be the same structured abstract you have prepared in advance. Why here it has a different format?

In the first place, we did not understand that the abstract of the article should be structured exactly as the abstract we had prepared in advance for the repository. Now we have corrected the abstract of the article to be the same structured abstract we had prepared in advance.

"The first one is “Crowdsourcing open citations with CROCI An analysis of the current status of open citations, and a proposal” by Heibi, Peroni and Shotton" (and similar): you have organised all the citations by means of footnote, but I requested to use APA style for referencing and citing. Thus, please, modify all the citations so as to comply with that style - as an example, see how the other group have cited things within their text. Footnotes, in this case, should be used for adding additional material to the discussion and not to cite an article.

We have modified all the citations (both inline and not inline) so as to comply with the APA style, as required.

"Then, two variables (i.e. : start_index and prefix_to_name_dict)" (and similar): please, use a different style to show the code. In addition, if you are talking about a code, you must provide to the reader some excerpt of it that allows him/her to understand what you are saying. Please, avoid to put in the method section the whole protocol, but focus on the main step, and support the discussion with visuals that help the reader to understand the various passages of the steps followed.

For what concerns the Subjects and Methods section, we welcomed the received suggestions and first of all we visually differentiated the text concerning the code by using Courier New font, in order to improve the visual quality of the article and to generally simplify the readers’ understanding of the explanations provided. Further, we restructured the whole software presentation by adopting a more divulgative and discursive explanatory approach. The code is not presented any more function by function, but its description is much more prone to present the various purposes which are addressed by the functions of each file in which the code is structured (i.e.: management of publishers’ identification, creation and compilation of the cache files, etc.) We also fostered the article comprehensibility by adding some visual support. In particular, we provided three images, representing:

A subsample dictionary which is aimed at representing the structure of our output.
An explanatory and very simplified scheme of the software structure, to present in a very general and divulgatory fashion what the various parts of the code are aimed at.
A Flow diagram of the software (which is also linked to its online pdf version, in order to improve the readability), which is aimed at giving an idea of how the main functions of the code interact.

"We are writing this article while the software hasn’t finished running yet": you should update the article with the final data and graphics.

The article was updated with the final data and graphics, so graphics are now different and the paragraphs commenting and explaining and discussing results have been rewritten.

"Figure 1" (and similar): the figures must be readable and contained within the margins. In case it is needed, put them (and only them) in a landscape view.

We have chosen to add the new figures of the new graphics and other relevant visuals contained within the margins.

"For the purposes of our research, the point was handled by locally creating a temporary JSON file to store information about each prefix and the publisher which it identifies": however, if I remember correctly, each publisher has a particular Crossref ID that characterises it (e.g. see the field "member" in a request for info about a DOI using the Crossref API). Why did you not use it, which is more reliable than the name?

In the end, we, indeed, used the member code as a way to identify and distinguish publishers. For this reason, we have corrected the article stating that the code now uses the field “member”.

More Corrections

Apart from the corrections proposed in the issue which emerged during the workshop, we have also addressed other tasks either came out during the workshop or which we have thought that could be useful in order to better contextualise the project. These further corrections are:

Sankey diagram in order to understand whether there are specific patterns (e.g. self-citations) in che relation between the first ten citing publishers for number of invalid citations and the first ten cited publishers (the diagram is available on both the website and the article);
The Sankey diagram has lead to the generation of a new piece of code in order to manipulate the data in the required way (available in the repository of the group together with the other pieces of code);
a bibliographic reference useful to identify the so called “major publishers”;
We have updated the website with the final graphics obtained from the final results of the software;
We used external APIs from mEDRA and DataCite to automatically extract the names of publishers with prefixes not recognised by Crossref;
We used Crossref member codes as keys to identify publishers with;
We added the unit tests as a measure of quality assurance.

open-sci / 2020-2021

Revision of material - team The Leftovers 2.0 #31

General comments from the presentation

DMP

Missing Citations in COCI: Publishers Analytics Result

Code for Missing Citations Analysis in COCI

Protocol

Software

Article

General comments from the presentation

DMP

Missing Citations in COCI: Publishers Analytics Result

Code for Missing Citations Analysis in COCI

Protocol

Software

Article

More Corrections