ncats / RDAS

5 stars 6 forks source link

Add annotation deduplication implementation #23

Closed brandon-gong closed 1 year ago

brandon-gong commented 2 years ago

This PR addresses #3.

It has currently only been tested against my free AuraDB instance, which is only a small subset of nodes compared to the full database. I think its very necessary to test it against the full database (perhaps on rdip1) and spend some time to make sure everything looks correct.

Also, the cases where Article nodes are found with a single unique Species annotation and multiple redundant Gene annotations are not currently handled. I'm not sure how to determine which Gene relationship to keep (probably would depend on the Species?). Anyway, in my Aura instance (21.2k nodes), this case never occurred, so it might not even be worth it to handle this case.


Changes:


To run the one-time script annotation_dedup.py, simply make sure the appropriate credentials for the database you wish to update is entered into config.ini, and run python3 annotation_dedup.py. It will begin running, and log information about its progress as it goes. The script can be stopped and resumed at any time without damaging any info in the database.


Separately, I'm wondering if we should organize the repository more cleanly. For example, annotation_dedup.py, initial_loading.py, and perhaps some future code (e.g. code to address #11) are strictly one-time scripts and do not really need to be maintained, i.e. we are only really keeping them for documentation purposes. I think we can create a scripts folder to store these scripts.

In contrast, annotations.py, update-neo4j.py, and perhaps some other files will be used on a regular basis and may need to be maintained/updated in the future, so they might also be grouped together?

(can discuss later, not essential for this PR)


TODO: