mskcc / smile-server

2 stars 4 forks source link

Investigate and resolve cases where samples are linked to multiple patient nodes. #1269

Closed ao508 closed 4 days ago

ao508 commented 1 week ago

DONE CONDITIONS


Note that this is separate from #1267 which addresses patient nodes that exist in the database with shared CMO patient ID aliases.

Some added context:

These patient nodes may not have the same CMO patient ID. Assuming that the sample's latest metadata has the correct CMO patient ID, we need to detach the other edges pointing to the patient nodes that do not have that matching CMO patient ID.

Example case:

smile sample id: 9630f18a-45ae-485c-8f3c-722121186df7 primary id: 11704_U_3 latest import date: 2022-08-31 latest cmo patient id: C-6K6KW9 Linked patients in graph db

smile patient id cmo id
4f68a470-2e6f-4e28-91b0-6d5e9241476d C-6K6KW9 *** correct patient to link to based on latest metadata
1a322af9-9dc0-4ccf-a118-1b60f967b4d9 JC-ucc-014
b4275a18-dfdd-4288-9377-0d69b317a9d4 C-UELL8F

Neo4j Cypher query:

MATCH (n:PatientAlias) WHERE n.value="JC-ucc-051" OR n.value="C-HX2JLU" RETURN n LIMIT 25
CleanShot 120347.png

The two overlapping samples were the only WES samples out of this group of samples, and they were all imported on the same date (6/8/22).

CleanShot 120249.png

Some findings on the duplicate Patient nodes in the prod database:

  1. There are 269 unique Patient nodes that share at least a Sample with another Patient
  2. All these Patients have a CMO Patient ID, only three have a DMP Patient ID
  3. There are 184 unique Sample nodes that are shared between Patients
  4. The breakdown of import dates of the 184 samples:
importDate sampleCount
2022-09-22 99
2023-05-10 45
2022-07-13 16
2022-07-08 6
2023-05-08 4
2023-05-12 3
2023-05-22 3
2022-06-08 2
2023-04-12 2
2022-08-31 1
2022-09-23 1
2023-05-03 1
2022-10-28 1

Used queries, in the same order as above:

  1. MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 RETURN COUNT(p1) a. To preview the clusters of duplicate Patients: MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN DISTINCT p1, p2, s LIMIT 5
  2. MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace = "cmoId" RETURN COUNT(DISTINCT p1) a. To view the list of CMO Patient IDs: MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace="cmoId" RETURN pa.value, p1.smilePatientId
  3. MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN COUNT(DISTINCT s)
  4. MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 MATCH (s)-[:HAS_METADATA]->(sm:SampleMetadata) WITH s, MAX(sm.importDate) AS latestImportDate WITH latestImportDate, COUNT(DISTINCT s) AS sampleCount RETURN latestImportDate AS importDate, sampleCount ORDER BY sampleCount DESC
ao508 commented 1 week ago

Please add your planning poker estimate with Zenhub @qu8n

ao508 commented 1 week ago

Script added here: https://github.mskcc.org/cmo/smile-utils/tree/main/samples_multiple_patients