molgenis / vibe

Variant Interpretation using Biomedical literature Evidence
GNU Lesser General Public License v3.0
0 stars 5 forks source link

New database containing invalid HGNC symbols. #82

Open svandenhoek opened 3 years ago

svandenhoek commented 3 years ago

Describe the bug The new VIBE v5.1 database contains gene symbols which are assumed to be HGNC gene symbols, but this is not always the case. This issue seems to also be present in the source dataset where these symbols are described as if they are HGNC symbols (f.e. http://identifiers.org/hgnc.symbol/MCS+9.7 where the rdfs:comment states it is a HGNC Gene Symbol).

To Reproduce

$ java -jar vibe-with-dependencies-5.1.4.jar -t vibe-5.1.0-hdt/vibe-5.1.0.hdt -w hp.owl -p HP:0002664
######## ######## ########
An unexpected exception occurred. Please notify the developer (see https://github.com/molgenis/vibe) and supply the text as seen below.
######## ######## ########
org.molgenis.vibe.core.exceptions.InvalidStringFormatException: hgnc:MCS+9.7 does not adhere the required format: ^(hgnc|HGNC):([a-zA-Z0-9#@/._-]+)$
    at org.molgenis.vibe.core.formats.Entity.retrieveIdFromString(Entity.java:131)
    at org.molgenis.vibe.core.formats.Entity.<init>(Entity.java:99)
    at org.molgenis.vibe.core.formats.GeneSymbol.<init>(GeneSymbol.java:36)
    at org.molgenis.vibe.core.database_processing.GenesForPhenotypeRetriever.retrieveData(GenesForPhenotypeRetriever.java:66)
    at org.molgenis.vibe.core.database_processing.GenesForPhenotypeRetriever.run(GenesForPhenotypeRetriever.java:39)
    at org.molgenis.vibe.core.GeneDiseaseCollectionRetrievalRunner.call(GeneDiseaseCollectionRetrievalRunner.java:31)
    at org.molgenis.vibe.cli.RunMode.retrieveDatasetOutput(RunMode.java:72)
    at org.molgenis.vibe.cli.RunMode.access$100(RunMode.java:23)
    at org.molgenis.vibe.cli.RunMode$4.runMode(RunMode.java:57)
    at org.molgenis.vibe.cli.RunMode.run(RunMode.java:125)
    at org.molgenis.vibe.cli.VibeApplication.executeRunMode(VibeApplication.java:59)
    at org.molgenis.vibe.cli.VibeApplication.main(VibeApplication.java:27)

Expected behavior Results are shown while data does not contain invalid HGNC symbols/these are not explicitly marked as HGNC symbols.

svandenhoek commented 3 years ago

When fixing, ensure tests are added on one of the failing HPO-terms (such as HP:0002664).

svandenhoek commented 3 years ago

DisGeNET will evaluate how to remove incorrect entries (that is, gene symbols that aren't HGNC symbols but do have an http://identifiers.org/hgnc.symbol/ IRI in DisGeNET). A hotfix will be implemented so that the current version of vibe with updated database will function, though this does mean that for now shown gene symbols aren't always HGNC approved symbols. When a new DisGeNET database release is available, a more long-term fix should be implemented based on the new RDF design (depending on how non-official gene symbols are stored within DisGeNET at that point).

svandenhoek commented 3 years ago

Non-valid HGNC symbols seem to already be present in previous releases as well. For example: http://identifiers.org/hgnc.symbol/LOC105375655 in DisGeNET v6.0.0 geneSymbol.ttl. It is therefore very likely every VIBE release is affected by this bug, yet did not throw an error because previous non-valic HGNC symbols did seem to adhere to the validation regex requirement.

svandenhoek commented 3 years ago

Hotfix implemented in #85 for 5.1 branch & #86 should merge this back into master, though the core of this issue requires an updated database.