identifiers-org / identifiers-org.github.io

MIT License
8 stars 1 forks source link

Inconsistent patterns in several entries #151

Open cthoyt opened 3 years ago

cthoyt commented 3 years ago

This PR points out several entries in identifiers.org with malformed patterns based on the lack of a ^ in the beginning of the pattern or a lacking $ in the end of the pattern. There are a few examples where this actually makes sense, but they are pretty uncommon so it makes it difficult to rely on a given regular expression pattern, which is in some ways necessary to deal with embedded LUIs.

prefix pattern ^ $
reactome (^R-[A-Z]{3}-\d+(-\d+)?(\.\d+)?$)|(^REACT_\d+(\.\d+)?$) x x
signaling-gatewayA\d{6}$ x
eco ECO:\d{7}$ x
mod ^MOD:\d{5} x
wikipathways WP\d{1,5}(\_r\d+)?$ x
mirbase MI\d{7} x x
obi (^OBI:\d{7}$)|(^OBI_\d{7}$) x x
uo ^UO:\d{7}? x
pmc PMC\d+ x x
cryptodb ^\w+ x
tritrypdb ^\w+(\.)?\w+(\.)?\w+ x
pmdb ^PM\d{7} x
mirbase.mature MIMAT\d{7} x x
nextprot ^NX_\w+ x
worfdb ^\w+(\.\d+)? x
sdbs \d+$ x
tarbase ^[a-z]{3}\-(mir|let|lin)\-\w+(\-\w+\-\w+)? x
inchikey ^[A-Z]{14}\-[A-Z]{10}(\-[A-Z])? x
affy.probeset \d{4,}((_[asx])?_at)? x x
pocketome ^[A-Za-z_0-9]+ x
ndc ^\d+\-\d+\-\d+ x
dailymed ^[A-Za-z0-9-]+ x
lincs.cell (^LCL-\d+$)|(^LDC-\d+$)|(^ES-\d+$)|(^LSC-\d+$)|(^LPC-\d+$)x x
d1id \S+ x x
vmhmetabolite [a-zA-Z0-9_\(\_\)\[\]]+ x x
vmhreaction [a-zA-Z0-9_\(\_\)\[\]]+ x x
ocid ocid:[0-9]{12} x x
nemo [a-z]{3}-[a-km-z0-9]{7} x x
minid.test [A-Za-z0-9]+$ x
transyt T[A-Z]\d{7} x x
knapsack ^C\d{8} x
biosimulators [a-zA-Z0-9-_]+ x x
vmhgene ^[0-9]+\.[0-9]+ x
runbiosimulations[0-9a-z]{24,24} x x

This table was generated with the following python code:

import requests
from tabulate import tabulate

#: see https://docs.identifiers.org/articles/api.html#getdataset
URL = 'https://registry.api.identifiers.org/resolutionApi/getResolverDataset'

def main():
    res = requests.get(URL).json()
    rows = []
    for entry in res['payload']['namespaces']:
        pattern = entry['pattern']
        has_carat = pattern.startswith('^')
        has_dollar = pattern.endswith('$')
        if not has_carat or not has_dollar:
            rows.append((entry['prefix'], pattern, '' if has_carat else 'x', '' if has_dollar else 'x'))
    print(tabulate(rows, headers=['prefix', 'pattern', '^', '$'], tablefmt='html'))

if __name__ == '__main__':
    main()