Open dhimmel opened 5 years ago
Thank you indeed for this study and for sharing the code and data. I have some questions about the data sources too.
The study doesn't seem to differentiate between deposits that were preprints versus those that were postprints.
I agree it would be nice to surface the distinction. An easy first step might be to use a median of the lag rather than the average. In subjects like maths where the deposit lag turned out to be conspicuously negative, it feels like we're mostly measuring how slow the journals were. If the positive and negative lags didn't cancel each other, we might have a clearer picture. (But I'm not sure I understand the meaning of the negative lag correctly: it seems weird to have so many publications with a lag of < –2000 days.)
However, I think the preprint vs. postprint distinction is much less important than the OA vs. not OA distinction. This passage struck me:
For this initial study we make the assumption that if a metadata record is in the repository, the full text is also deposited. This is because validating if the full text is deposited is a complicated
I think this is a giant leap. With Italian institutional repositories, in the best case you find they have 10-15 % of records with a full text, while the average seems to be less than 7 % https://www.base-search.net/Search/Results?lookfor=doctype%3A1*+country%3Ait (or about twice as much in the 2013-2017 period).
It's not clear to me how the data was narrowed down to a set which allows to say something about open access status and compliance, rather than just about how comprehensive the librarians were in cataloguing the publications in their institutional repository (for purposes of productivity assessment or others). At the end of section 4 you state that from some 15M rows you went to 1.5M (of which 800k had a deposit date) but I didn't catch whether it was just by virtue of the publication date cutoff (from the dataset it looks so), or countries or something else.
Still on alternative sources:
neither BASE nor OpenAIRE make the datasets publicly available for download and analysis. Furthermore, judging from the user interfaces of both, deposit dates do not appear to be available.
Having a CC-0 dump is a strong advantage of CORE indeed. As for the dates of retrieval, however, I thought they were available at least from the RSS feeds, and probably in the datestamp field of the OAI-PMH interface: http://oai.base-search.net/#oai-dc
Unpaywall data may be the easiest comparison point. Their unpaywall_snapshot_2018-09-24T232615.jsonl.bz2
has about 24M DOIs with an OA fulltext URL. From your dataset I find 530705 DOIs, which go down to 529692 once cleaned with the sed script at https://doi.org/10.5281/zenodo.997221; of these, "only" 246664 have an OA URL whatsoever (including gold OA) in the Unpaywall dump.
jcdl_2019_dataset.dois.oa.csv.gz jcdl_2019_dataset.dois.csv.gz
I provided comments on this study to journalist Dalmeet Chawla, prior to its publication, for a news piece in Physics Today. I've copied my comments below. Note that my comments about the code not being available at the time of my review are not relevant to the published paper. The GitHub has been updated and the data archive posted to Zotero.
I found the study innovative and promising as a systematic method for evaluating trends in repository deposit times. However, given the available data and methodology, I consider the findings very preliminary and subject to uncontrolled factors. My unabridged comments are below (also cross-posted on Publons):
I think systematically evaluating deposit time lags is interesting.
Integrating the Crossref and CORE catalogs could create a useful dataset for many analyses. Currently, the study's code repository and data archive are blank / unreleased. Therefore, I cannot evaluate whether the integrated data from this study will be reusable, but if released and documented properly following publication, I think the code/data could be a great resource. Specifically, the "any repository" deposit date is helpful to indicate when a study was first deposited. I also thought the method for using Mendeley readership to assign disciplines to articles was innovative.
Regarding the main question, whether deposit lag times have decreased, I have a few comments:
As the authors note, average deposit time is biased unless all articles have had equal time to accumulate deposits. For this reason, I would avoid placing much emphasis on analyses that did not limit deposits to a standard window following publications for all years that are being reported on. Therefore, I would focus on Figure 8 and Figure 15 rather than Figure 7 to determine whether deposit times have changed. Figure 15 is probably most reliable with its two-year window. Figure 15 shows that lag time decreased for articles deposited in Italian and UK repositories, but not so much for the other countries. Since I don't believe Figure 9, Figure 10, Figure 11, Figure 12 or Figure 13 use a fixed deposit window, I question their reliability.
There is the possibility of a large selection bias because the study only considers articles that were deposited. For example, imagine 100 articles are produced which all have a funder mandate to be deposited. In scenario A, 10 of the articles are deposited 30 days following publication, while the remaining 90 are never deposited. In scenario B, 80 of the articles are deposited 90 days following publication, while the remaining 20 are never deposited. Scenario A would have a smaller average deposit time lag (30 versus 90 days), but scenario B is preferable from the perspective of the funder.
The study doesn't seem to differentiate between deposits that were preprints versus those that were postprints. While funder policies vary, I imagine most policies are interested in the final version of a manuscript being deposited rather than an precursor preprint. Note that Figure 12 shows negative average deposit time lags for 2017 articles in Math and Physics & Astronomy, two disciplines which preprint extensively on arXiv. arXiv is also the top repository in Table 3. Accordingly, what extent of the decrease in deposit lag times is driven by increased preprinting? Would the results change if only considering deposits that were content-identical to their published version?
CORE apparently contains open access articles from both repositories (e.g. pre/post-prints) and journals. It wasn't clear to me whether the authors excluded CORE records from journals. If not, could the deposit date refer to the same journal publication event as the Crossref date?
The authors did ignore articles whose Crossref publication date consisted just of a year without a month or date. However, I don't believe they address the high prevalence of Crossref publication dates that are set to the first of the month or year, as I visualize here. In fact, the authors set articles without a publication day to the first of the month. I suspect therefore that for many articles the effective lag time precision is the number of calendar months and not the number of days. Perhaps therefore, it would be more accurate to present results as monthly lag times (for example, as we've done here in a different context).
Based on these considerations, I think the study provides intriguing early evidence that deposition lag times may have decreased. However, further more controlled analysis would be required to definitively attest to this finding.
On a minor note, I was amused by the finding that "the date of acceptance in Crossref is in 99.9% of cases not available and in 98% of cases where it is available, it is incorrect." I'd be interested to see Crossref's documentation provided to publishers regarding this field. Publishers are able to deposit acceptance dates to PubMed at higher fidelity, so I'm curious about what is different about Crossref. Perhaps, it's that this metadata field was recently added or is not well documented?
P.S. Thanks for making your code and data open. I always enjoy being able to engage in the scientific process through forums like GitHub Issues! Cheers, Daniel.