greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

Analyses that depend on path counts now filtered from the database #174

Closed dhimmel closed 3 years ago

dhimmel commented 3 years ago

In https://github.com/greenelab/hetmech/pull/173, I began touching up some of the rephetio epilepsy predictions for the manuscript. Specifically, I'm most interested in including visualizations from:

Those notebooks connected to a legacy database we had hosted on a Penn workstation that is either no longer online or firewalled. In https://github.com/greenelab/hetmech/commit/60f4826a951d007eeb19227e19669544e57eeb93, I switched over to using the production database.

However, the production database does not include all nonzero path counts, since it filters based on a p-value threshold to save database storage. See Prioritizing enriched metapaths for database storage in draft manuscript. However, I believe the notebooks above were developed by @ben-heil against a database that did not filter any rows from the path_counts table (a database populated before https://github.com/greenelab/connectivity-search-backend/pull/41). @ben-heil does that sound correct?

dhimmel commented 3 years ago

Solutions

1. API paths queries

The API has the ability to generate p-values for DWPCs (for metapaths up to length 3) that aren't stored in the database. For example, https://search-api.het.io/v1/paths/source/29/target/6215/metapath/CrCpD/?limit=0. However, to get all DWPCs for Compound-Epilepsy pairs for the 136 metapaths would require 211,072 API queries. This would take a bit over 24 on my slow internet connection. Not Ideal

2. HetMatPy

We could use the hetmatpy module and download the HetMat archives to recalculate these values in this repository.

3. reload database

We could reload the postgresql database to not filter any compound...disease path counts by p-values. This would also make the webapp more useful for Compound...Disease queries.

Worst case this would add 27 million (136 1500 136) rows to the path counts table. But the actual number is an order of magnitude smaller because most of these have zero paths.

So I'm leaning towards this solution, although regenerating the database is a bit cumbersome.

ben-heil commented 3 years ago

Those notebooks connected to a legacy database we had hosted on a Penn workstation that is either no longer online or firewalled. In 60f4826, I switched over to using the production database.

Actually both :) Penn put up a firewall so that you can only remote in if you're using their VPN, but also I had to plug in a different computer at your desk and there was only one working ethernet port. If you need me to get anything off the workstation lmk and I can schedule a time to go into the office.

However, the production database does not include all nonzero path counts, since it filters based on a p-value threshold to save database storage. See Prioritizing enriched metapaths for database storage in draft manuscript. However, I believe the notebooks above were developed by @ben-heil against a database that did not filter any rows from the path_counts table (a database populated before greenelab/connectivity-search-backend#41). @ben-heil does that sound correct?

The notebooks I worked on for my rotation were definitely developed before April 2019 since the rotation ended ~ March 22. I'm not certain on the database timing, but if the reduced database wasn't released until April 2019 then all my notebooks should have been generated by the old version.

dhimmel commented 3 years ago

@dongbohu in https://github.com/greenelab/connectivity-search-backend/pull/79, I created a new version of the database that retains all non-zero path counts for Compound...Disease metapaths (option 3 above). I know have a 5.3G file connectivity-search-pg_dump.sql.gz. I'd like to upload the final version of our database to Zenodo, but don't want to burn our Zenodo quota if this database doesn't end up becoming final.

Is there a GCS bucket I could upload to? Then you'd be able to reload the production search-db.het.io database from this SQL dump?

dongbohu commented 3 years ago

@dhimmel Sure. I will create a bucket on Google cloud. After you upload the dump file, I can load it into the production database. The backend will have to be down for 2-3 hours. Is that okay with you?

dhimmel commented 3 years ago

Will run:

gsutil cp connectivity-search-pg_dump.sql.gz gs://connectivity-search/db/2021-01-12/connectivity-search-pg_dump.sql.gz
gsutil setmeta \
  -h "x-goog-meta-source-commit:50801395c58311b7e18c890922b2875c3a875c06" \
  -h "x-goog-meta-source-info:https://github.com/greenelab/connectivity-search-backend/pull/79" \
  gs://connectivity-search/db/2021-01-12/connectivity-search-pg_dump.sql.gz
dongbohu commented 3 years ago

@dhimmel Let me know when the file uploading is done.

dhimmel commented 3 years ago

Upload is complete. I also added public read access via the GUI. Here are the relevant access details

# GCS Browser URL
https://console.cloud.google.com/storage/browser/_details/connectivity-search/db/2021-01-12/connectivity-search-pg_dump.sql.gz

# Public URL
https://storage.googleapis.com/connectivity-search/db/2021-01-12/connectivity-search-pg_dump.sql.gz

# URI
gs://connectivity-search/db/2021-01-12/connectivity-search-pg_dump.sql.gz

@dongbohu did you want to reload the database now?

dongbohu commented 3 years ago

I will do it tomorrow morning.

On Jan 20, 2021, at 5:31 PM, Daniel Himmelstein notifications@github.com wrote:

 Upload is complete. I also added public read access via the GUI. Here are the relevant access details

GCS Browser URL

https://console.cloud.google.com/storage/browser/_details/connectivity-search/db/2021-10-12/connectivity-search-pg_dump.sql.gz

Public URL

https://storage.googleapis.com/connectivity-search/db/2021-10-12/connectivity-search-pg_dump.sql.gz

URI

gs://connectivity-search/db/2021-10-12/connectivity-search-pg_dump.sql.gz @dongbohu did you want to reload the database now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

dongbohu commented 3 years ago

@dhimmel: Why the dump file's path in the bucket is 2021-10-12? Is it supposed to 2021-01-12 instead?

dhimmel commented 3 years ago

@dongbohu good catch. I moved the file to have the correct date:

gsutil mv gs://connectivity-search/db/2021-10-12/connectivity-search-pg_dump.sql.gz gs://connectivity-search/db/2021-01-12/connectivity-search-pg_dump.sql.gz
dongbohu commented 3 years ago

@dhimmel The new database is up. Please check it when you get time.

dhimmel commented 3 years ago

Please check it when you get time.

Looks good. I evaluated https://het.io/search/?source=167&target=18102&complete=. Note there are metapaths that don't show as precomputed, but those all have zero paths.