Investigate and resolve cases where samples are linked to multiple patient nodes.

ao508 commented 1 week ago

DONE CONDITIONS

Data is fixed (all affected samples only point to one patient node)
Script is checked into smile-utils repo for monitoring and catching future cases like this

Note that this is separate from #1267 which addresses patient nodes that exist in the database with shared CMO patient ID aliases.

Some added context:

These patient nodes may not have the same CMO patient ID. Assuming that the sample's latest metadata has the correct CMO patient ID, we need to detach the other edges pointing to the patient nodes that do not have that matching CMO patient ID.

Example case:

smile sample id: 9630f18a-45ae-485c-8f3c-722121186df7 primary id: 11704_U_3 latest import date: 2022-08-31 latest cmo patient id: C-6K6KW9 Linked patients in graph db

smile patient id	cmo id
4f68a470-2e6f-4e28-91b0-6d5e9241476d	C-6K6KW9	*** correct patient to link to based on latest metadata
1a322af9-9dc0-4ccf-a118-1b60f967b4d9	JC-ucc-014
b4275a18-dfdd-4288-9377-0d69b317a9d4	C-UELL8F

Neo4j Cypher query:

MATCH (n:PatientAlias) WHERE n.value="JC-ucc-051" OR n.value="C-HX2JLU" RETURN n LIMIT 25

The two overlapping samples were the only WES samples out of this group of samples, and they were all imported on the same date (6/8/22).

Some findings on the duplicate Patient nodes in the prod database:

There are 269 unique Patient nodes that share at least a Sample with another Patient
All these Patients have a CMO Patient ID, only three have a DMP Patient ID
There are 184 unique Sample nodes that are shared between Patients
The breakdown of import dates of the 184 samples:

importDate	sampleCount
2022-09-22	99
2023-05-10	45
2022-07-13	16
2022-07-08	6
2023-05-08	4
2023-05-12	3
2023-05-22	3
2022-06-08	2
2023-04-12	2
2022-08-31	1
2022-09-23	1
2023-05-03	1
2022-10-28	1

Used queries, in the same order as above:

MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 RETURN COUNT(p1) a. To preview the clusters of duplicate Patients: MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN DISTINCT p1, p2, s LIMIT 5
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace = "cmoId" RETURN COUNT(DISTINCT p1) a. To view the list of CMO Patient IDs: MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace="cmoId" RETURN pa.value, p1.smilePatientId
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN COUNT(DISTINCT s)
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 MATCH (s)-[:HAS_METADATA]->(sm:SampleMetadata) WITH s, MAX(sm.importDate) AS latestImportDate WITH latestImportDate, COUNT(DISTINCT s) AS sampleCount RETURN latestImportDate AS importDate, sampleCount ORDER BY sampleCount DESC

ao508 commented 1 week ago

Please add your planning poker estimate with Zenhub @qu8n

ao508 commented 1 week ago

Script added here: https://github.mskcc.org/cmo/smile-utils/tree/main/samples_multiple_patients

mskcc / smile-server

Investigate and resolve cases where samples are linked to multiple patient nodes. #1269