protegeproject / protege

Protege Desktop
http://protege.stanford.edu
Other
971 stars 229 forks source link

Do not allow sub-string matching when recognizing OBO IDs. #1100

Closed gouttegd closed 1 year ago

gouttegd commented 1 year ago

The OboUtilities.isOboId method in org.protege.editor.owl.model.util is currently using the Pattern.find method to check whether its argument is an OBO ID.

This is inappropriate as the Pattern.find method looks for a subsequence within the string, instead of attempting to match the full string. As a result, a bogus OBO ID such as "MONDO:0018299-obsoleted" will be incorrectly identified as a valid OBO ID because the Pattern.find method will simply stop matching at the first character that does not fit the pattern (the '-'), but will still report that it has find a match.

Here, we need to check that the entire string matches the OBO ID pattern, which is why this commit replaces the use of Pattern.find by Pattern.match, which does precisely that.

This is also what the getOboLibraryIriFromOboId method, in the same class, is doing. For consistency, isOboId should do the same, so that when isOboId returns true, the calling code can then invoke getOboLibraryIriFromOboId on the same string and be confident that the call would not throw a "Invalid OBO Id" RuntimeException.

gouttegd commented 1 year ago

This is actually not quite right in general, as it does not catch cases like PRO, or OBA, correctly, where (\d+) is not enough - it should be ([A-Z0-9]+) instead.

The pattern is correct for what the OBO specification calls Canonical Prefixed ID, where only digits are allowed in the “local” part.

The pattern will indeed not recognise Non-Canonical Prefixed ID, where almost everything is allowed apart from whitespaces (not just [A-Z0-9]+). I assume this is what is intended, so will leave that pattern as it is.

jamesaoverton commented 1 year ago

@gouttegd is right about the official rules. In practise, these are the exceptions to the OBO numeric ID policy that the PURL system is aware of: PR's adopted UniProt IDs, NCIT's C\d+ concept IDs, RO_HOM\d+, and GNO. I don't know the OBA case the @matentzn mentioned. The PR and NCIT are significant enough that you might consider making exceptions for them.

https://github.com/OBOFoundry/purl.obolibrary.org/blob/master/config/obo.yml#L110