openjournals / joss

The Journal of Open Source Software
MIT License
1.46k stars 183 forks source link

Archives for DOIs are set to the wrong values #1324

Closed sdruskat closed 2 months ago

sdruskat commented 3 months ago

Hi 👋,

Stumbled over two bad software_archives:

  1. For

It seems it has been inadvertently set to after it was previously correctly set to

  1. For

The Zenodo DOI has been truncated to The correct one is

Perhaps this can be fixed in the metadata?

sdruskat commented 3 months ago

Also, please let me know if

sdruskat commented 3 months ago

See above, put up as the only place I could find the offending strings in the GH org :).

sdruskat commented 3 months ago

Found another one (reported in

sdruskat commented 3 months ago

Found another one (reported in openjournals/buffy#103 (comment)):

Fixed in

xuanxu commented 3 months ago

@sdruskat Thanks for reporting this! The easy way to correct the wrong values is for an EiC to regenerate the pdf and the metadata files. That can be done reaccepting the paper, so the best place to report these things is the review issue of the affected papers. I'll ping the EiC for these two cases.

sdruskat commented 3 months ago

the best place to report these things is the review issue of the affected papers.

Thanks for the pointer @xuanxu! I'll do this for future issues. 👍

I'll ping the EiC for these two cases.

Three cases with the 🐝, and thanks!

sneakers-the-rat commented 3 months ago

Continuing this, I just went and checked all the archive links in joss-papers , and these are the papers that have problems:

file archive status
10.21105.joss.00040.crossref.xml 500
10.21105.joss.00612.crossref.xml (Missing link)
10.21105.joss.00971.crossref.xml (Missing link)
10.21105.joss.02314.crossref.xml 500
10.21105.joss.04439.crossref.xml 404
10.21105.joss.04591.crossref.xml 404
10.21105.joss.04684.crossref.xml 404
10.21105.joss.05395.crossref.xml 410
10.21105.joss.05883.crossref.xml 404

not bad overall :)

here's the script (nothing special, just a one-off thing):

expand/collapse ```python """ Check whether the archive DOI for each paper resolves to a page. run this from within the joss-papers directory. because of the handling of ratelimiting, you'll have to run this a few times until you no longer skip for ratelimits. generates - `joss_archive_links.csv` - see `Results` for columns - `joss_archive_links_clean.csv` - see `clean_csv` - `joss_doi_pages` - xz compressed cache of the resolved archive pages requires: - requests - tqdm - pandas """ import csv from xml.etree import ElementTree from pathlib import Path from dataclasses import dataclass, fields, asdict from typing import Optional, Literal, Union import lzma from multiprocessing import Pool, Lock, Event from time import sleep, time from math import ceil import requests from tqdm import tqdm import pandas as pd data_file = Path('joss_archive_links.csv') cache_dir = Path('joss_doi_pages') NAMESPACES = { 'rel': "" } @dataclass class Results: file: str archive: Optional[str] = None valid: bool = False status: Optional[int] = None error: Optional[str] = None retry_after: Optional[float] = None def process_paper(path:Path) -> Optional[Results]: out_file = cache_dir / path.with_suffix('.html.xz').name if out_file.exists(): return paper = ElementTree.parse(path).getroot() res = {} res['file'] = try: archive = paper.find(".//rel:inter_work_relation[@relationship-type='references']", NAMESPACES).text archive = archive.lstrip('“').rstrip('”') if not archive.startswith('http'): archive = '' + archive res['archive'] = archive # hold if we are currently in a ratelimit cooldown. lock.wait() req = requests.get(res['archive']) res['status'] = req.status_code match res['status']: case 429: res['retry_after'] = float(req.headers['x-ratelimit-reset']) case 200: res['valid'] = True if res['status'] != 429: with, 'w') as cache_file: cache_file.write(req.content) except Exception as e: res['error'] = str(e) return Results(**res) def init_lock(l): """make a lock (now an event) available as a global across processes in a pool""" global lock lock = l def wait(lock:Event, result:Results, message:tqdm): """if we get a 429, acquire the lock until we can start again""" lock.clear() wait_time = ceil(result.retry_after - time()) message.reset(wait_time) for i in range(int(wait_time)): sleep(1) message.update() lock.set() def main(): rate_lock = Event() rate_lock.set() cache_dir.mkdir(exist_ok=True) # ya i know i ruin the generator but i like progress bars with totals files = list(Path('.').glob("joss*/*crossref.xml")) try: all_pbar = tqdm(total=len(files), position=0) good = tqdm(position=1) bad = tqdm(position=2) message = tqdm(position=3) pool = Pool(16, initializer=init_lock, initargs=(rate_lock,)) if not data_file.exists(): with open(data_file, 'w', newline='') as dfile: writer = csv.DictWriter(dfile, [ for field in fields(Results)]) writer.writeheader() with open(data_file, 'a', newline='') as dfile: writer = csv.DictWriter(dfile, [ for field in fields(Results)]) for result in pool.imap_unordered(process_paper, files): all_pbar.update() if result is None: continue if result.retry_after: wait(rate_lock, result, message) if result.valid: good.update() else: bad.update() writer.writerow(asdict(result)) finally: all_pbar.close() good.close() bad.close() clean_csv() def clean_csv(path:Path = data_file): """ - remove 429s - deduplicate rows (if identical) - sorts by `valid` and then `file` """ df = pd.read_csv(path) df = df.loc[df['status'] != 429] df = df.drop_duplicates() df = df.sort_values(['valid', 'file'], ignore_index=True) out_fn = (path.parent / (path.stem + '_clean')).with_suffix('.csv') df.to_csv(out_fn, index=False) if __name__ == "__main__": main() ```
xuanxu commented 2 months ago

Closing this issue as PDFs and metadata on the three papers have been corrected and re-deposited.

arfon commented 2 months ago

Quick update:


Now fixed.


Is a 410 which looks to be some kind of "User was blocked" thing. I'm not sure what to do about this one.


Looks like there was some kind of error with the reaccept compilation here. @xuanxu – any ideas what is going on there?


Now fixed.


Looks like the paper is missing. I've asked the author to re-add it:


Seems to resolve for me now?


Looks like it's missing from the PDF and the XML files? This probably needs manual handling.


Same issue as 10.21105.joss.00971.crossref.xml. It's missing from the paper and the Crossref XML but the DOI is resolving.


I think we should report this to Zenodo as an issue.