ncats / pharos-graphql-server

3 stars 4 forks source link

Optimal query for literature references linking targets and ligands #94

Open eric-czech opened 1 year ago

eric-czech commented 1 year ago

I would like to know what publications associate targets and ligands such that the publications explicitly note some interaction/relationship between the pair (not just the target or just the ligand). The query in https://github.com/ncats/pharos-graphql-server/issues/93 seemed like a reasonable place to start. Is there a better way to do this?

I would also like to run this query infrequently (monthly or quarterly at most) and with no filter, i.e. I'd like to capture all ligand <-> target relationships with citations.

Any suggestions on the best way to accomplish this would be appreciated. Thanks!

KeithKelleher commented 1 year ago

That query looks good for fetching the publications that we have for reporting each known target ligand interaction. There are a couple of things to add.

  1. add a field alias for drugs - there's an issue to fix this, but without telling the API that you want drugs AND ligands (i.e. approved and unapproved compounds), it will just give you back the ligands
  2. add a field ligandCounts - for sanity checks that the numbers of drugs and ligands you're getting back is consistent

    ligandCounts { name value } ligands(isdrug: false) { ligid name description isdrug activities { pubs { pmid title year } } } drugs: ligands (isdrug: true) { ligid name description isdrug activities { pubs { pmid title year } } }

If you want to run this query for all targets, you'll probably have to paginate the results, or else it will be slow, and have a very large response. It seems your doing that already, so that's good. One optimization to make would be to filter your target list to Tchem and Tclin targets, since knowing if a target has a chemical interaction is the main criteria to no longer be considered Tdark or Tbio.

"filter": { "facets": [ { "facet": "Target Development Level", "values": ["Tclin", "Tchem"] } ] }

The other thing to consider is that the data in TCRD (and subsequently Pharos) is a subset of ligand activities that come primarily from DrugCentral and Chembl, where activities below a threshold are not included. Here is the blurb on Pharos about the criteria to be included:

Activity Thresholds Activity values from DrugCentral and ChEMBL must be standardizable to -Log Molar units AND meet the the following target-family-specific cutoffs: GPCRs: <= 100nM Kinases: <= 30nM Ion Channels: <= 10μM Non-IDG Family Targets: <= 1μM

If you want data outside those criteria, you'd probably want to get data straight from Chembl and DrugCentral.

eric-czech commented 1 year ago

Thanks again @KeithKelleher, that's extremely helpful! We'll try those improvements and you can close this if you'd like, otherwise I'll leave it open and report back for the sake of posterity (or if any other questions come up).

KeithKelleher commented 1 year ago

Glad to help. Yes, let us know how it goes, and if there's anything else.

Rahkovsky commented 1 year ago

@KeithKelleher, thank you very much for your advice. We have run the following query looping over the offset and limit values.

query query ($offset: Int!, $limit: Int!) { targets { targets(skip: $offset, top: $limit) { name sym uniprot facetValues(facetName: "Target Development Level") ligandCounts { name value } nonDrugLigands: ligands(isdrug: false, top:10000) { ligid name description isdrug activities { pubs { pmid title year } } } DrugLigands: ligands(isdrug: true, top:10000) { ligid name description isdrug activities { pubs { pmid title year } } } } } }

We found out that the default is to extract maximum 10 ligands per protein, so to override it, we need to add a top parameter with sufficiently large value:

DrugLigands: ligands(isdrug: true, top:10000)

The counts of unique proteins and unique protein-ligids combinations are almost identical. Curiously, we extract little bit more records from DrugLigands + nonDrugLigands query than from validation counts: Screenshot 2023-06-26 at 6 58 51 PM. Do you know what maybe a reason for it?