It has currently only been tested against my free AuraDB instance, which is only a small subset of nodes compared to the full database. I think its very necessary to test it against the full database (perhaps on rdip1) and spend some time to make sure everything looks correct.
update-neo4j.py also needs to be tested. I have not done this yet (maybe Devon can do this real quick?)
Also, the cases where Article nodes are found with a single unique Species annotation and multiple redundant Gene annotations are not currently handled. I'm not sure how to determine which Gene relationship to keep (probably would depend on the Species?). Anyway, in my Aura instance (21.2k nodes), this case never occurred, so it might not even be worth it to handle this case.
Changes:
annotations.py: Contains classes for creating new PubtatorAnnotation nodes, searching for duplicates, and eliminating duplicates. The large majority of logic, including the rules for determining whether nodes are duplicates, is found here.
update-neo4j.py: Has been changed to use the AnnotationManager class from annotations.py instead of direct Cypher query to create new PubtatatorAnnotation nodes. In this way, duplicate annotations will not be added at all in the future.
annotation_dedup.py: Contains one-time script for deduplicating the existing database. This only needs to be run once, and will eliminate all existing duplicates and convert all PubtatorAnnotation nodes to the new type (with list[str]text property instead of just str)
To run the one-time script annotation_dedup.py, simply make sure the appropriate credentials for the database you wish to update is entered into config.ini, and run python3 annotation_dedup.py. It will begin running, and log information about its progress as it goes. The script can be stopped and resumed at any time without damaging any info in the database.
Separately, I'm wondering if we should organize the repository more cleanly. For example, annotation_dedup.py, initial_loading.py, and perhaps some future code (e.g. code to address #11) are strictly one-time scripts and do not really need to be maintained, i.e. we are only really keeping them for documentation purposes. I think we can create a scripts folder to store these scripts.
In contrast, annotations.py, update-neo4j.py, and perhaps some other files will be used on a regular basis and may need to be maintained/updated in the future, so they might also be grouped together?
This PR addresses #3.
It has currently only been tested against my free AuraDB instance, which is only a small subset of nodes compared to the full database. I think its very necessary to test it against the full database (perhaps on rdip1) and spend some time to make sure everything looks correct.
update-neo4j.py
also needs to be tested. I have not done this yet (maybe Devon can do this real quick?)Also, the cases where
Article
nodes are found with a single uniqueSpecies
annotation and multiple redundantGene
annotations are not currently handled. I'm not sure how to determine whichGene
relationship to keep (probably would depend on the Species?). Anyway, in my Aura instance (21.2k nodes), this case never occurred, so it might not even be worth it to handle this case.Changes:
annotations.py
: Contains classes for creating newPubtatorAnnotation
nodes, searching for duplicates, and eliminating duplicates. The large majority of logic, including the rules for determining whether nodes are duplicates, is found here.update-neo4j.py
: Has been changed to use theAnnotationManager
class fromannotations.py
instead of direct Cypher query to create newPubtatatorAnnotation
nodes. In this way, duplicate annotations will not be added at all in the future.annotation_dedup.py
: Contains one-time script for deduplicating the existing database. This only needs to be run once, and will eliminate all existing duplicates and convert allPubtatorAnnotation
nodes to the new type (withlist[str]
text
property instead of juststr
)To run the one-time script
annotation_dedup.py
, simply make sure the appropriate credentials for the database you wish to update is entered intoconfig.ini
, and runpython3 annotation_dedup.py
. It will begin running, and log information about its progress as it goes. The script can be stopped and resumed at any time without damaging any info in the database.Separately, I'm wondering if we should organize the repository more cleanly. For example,
annotation_dedup.py
,initial_loading.py
, and perhaps some future code (e.g. code to address #11) are strictly one-time scripts and do not really need to be maintained, i.e. we are only really keeping them for documentation purposes. I think we can create ascripts
folder to store these scripts.In contrast,
annotations.py
,update-neo4j.py
, and perhaps some other files will be used on a regular basis and may need to be maintained/updated in the future, so they might also be grouped together?(can discuss later, not essential for this PR)
TODO:
annotation_dedup.py
on full databaseupdate-neo4j.py