Data is fixed (all affected samples only point to one patient node)
Script is checked into smile-utils repo for monitoring and catching future cases like this
Note that this is separate from #1267 which addresses patient nodes that exist in the database with shared CMO patient ID aliases.
Some added context:
These patient nodes may not have the same CMO patient ID. Assuming that the sample's latest metadata has the correct CMO patient ID, we need to detach the other edges pointing to the patient nodes that do not have that matching CMO patient ID.
*** correct patient to link to based on latest metadata
1a322af9-9dc0-4ccf-a118-1b60f967b4d9
JC-ucc-014
b4275a18-dfdd-4288-9377-0d69b317a9d4
C-UELL8F
Neo4j Cypher query:
MATCH (n:PatientAlias) WHERE n.value="JC-ucc-051" OR n.value="C-HX2JLU" RETURN n LIMIT 25
The two overlapping samples were the only WES samples out of this group of samples, and they were all imported on the same date (6/8/22).
Some findings on the duplicate Patient nodes in the prod database:
There are 269 unique Patient nodes that share at least a Sample with another Patient
All these Patients have a CMO Patient ID, only three have a DMP Patient ID
There are 184 unique Sample nodes that are shared between Patients
The breakdown of import dates of the 184 samples:
importDate
sampleCount
2022-09-22
99
2023-05-10
45
2022-07-13
16
2022-07-08
6
2023-05-08
4
2023-05-12
3
2023-05-22
3
2022-06-08
2
2023-04-12
2
2022-08-31
1
2022-09-23
1
2023-05-03
1
2022-10-28
1
Used queries, in the same order as above:
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 RETURN COUNT(p1)
a. To preview the clusters of duplicate Patients: MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN DISTINCT p1, p2, s LIMIT 5
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace = "cmoId" RETURN COUNT(DISTINCT p1)
a. To view the list of CMO Patient IDs: MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace="cmoId" RETURN pa.value, p1.smilePatientId
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN COUNT(DISTINCT s)
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 MATCH (s)-[:HAS_METADATA]->(sm:SampleMetadata) WITH s, MAX(sm.importDate) AS latestImportDate WITH latestImportDate, COUNT(DISTINCT s) AS sampleCount RETURN latestImportDate AS importDate, sampleCount ORDER BY sampleCount DESC
DONE CONDITIONS
smile-utils
repo for monitoring and catching future cases like thisNote that this is separate from #1267 which addresses patient nodes that exist in the database with shared CMO patient ID aliases.
Some added context:
These patient nodes may not have the same CMO patient ID. Assuming that the sample's latest metadata has the correct CMO patient ID, we need to detach the other edges pointing to the patient nodes that do not have that matching CMO patient ID.
Example case:
smile sample id: 9630f18a-45ae-485c-8f3c-722121186df7 primary id: 11704_U_3 latest import date: 2022-08-31 latest cmo patient id: C-6K6KW9 Linked patients in graph db
Neo4j Cypher query:
The two overlapping samples were the only WES samples out of this group of samples, and they were all imported on the same date (6/8/22).
Some findings on the duplicate Patient nodes in the prod database:
Used queries, in the same order as above:
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 RETURN COUNT(p1)
a. To preview the clusters of duplicate Patients:MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN DISTINCT p1, p2, s LIMIT 5
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace = "cmoId" RETURN COUNT(DISTINCT p1)
a. To view the list of CMO Patient IDs:MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 WITH DISTINCT p1 MATCH (pa:PatientAlias)-[:IS_ALIAS]->(p1) WHERE pa.namespace="cmoId" RETURN pa.value, p1.smilePatientId
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 RETURN COUNT(DISTINCT s)
MATCH (p1:Patient)-[:HAS_SAMPLE]->(s:Sample)<-[:HAS_SAMPLE]-(p2:Patient) WHERE p1 <> p2 MATCH (s)-[:HAS_METADATA]->(sm:SampleMetadata) WITH s, MAX(sm.importDate) AS latestImportDate WITH latestImportDate, COUNT(DISTINCT s) AS sampleCount RETURN latestImportDate AS importDate, sampleCount ORDER BY sampleCount DESC