I am including Toan Ong as the technical contact for Project 3. I can serve as technical contact for Projects 1 and 2. Please let us know specific questions you have about Project 3, which is the most detailed request.
Thanks,
Jason
From: David-Dimarino, Ernesto daviddim@med.usc.edu
Sent: Thursday, February 2, 2017 8:20 AM
To: Jason N. Doctor; Laura LaCorte
Cc: Renelle Davis
Subject: RE: OHMPI protocol
Jason,
I think the best way to proceed is to provide me all the necessary steps and applications and the technical contact that I will work with.
I will carve out some time in my calendar and run the entire process locally.
I will open a project in the Data Warehouse’s SR system. This project doesn’t have an OHMPI dependency and we can leverage existing data warehouse objects to support it.
I’d link to understand the “whoever responsible for perform the linkage,” I am assuming that would be my team but wasn’t sure in the way this was stated.
Thanks for your patience. We have 3 different projects (methods) we are studying so I’ve specified how the data are needed for each by project number. (See my responses below). Let me know if you have any additional questions.
Thanks for your help,
Jason
Please define how data are hashed.
For project 1 we just need the identifier element hashed using SHA-256 hash.
For project 2 we need to use format preserving encryption (which can use AES as a base)
For project 3, the encryption needed is more complicated. It uses cURL. The current data encryption software is just a stand-alone Java jar file, so you need Java (version 1.7 or later) installed. You also need to install a PostgreSQL DBMS (version 9.1 or later) (or SQL Server 2012 or later) and load the clear-text data into a table. The jar file, executed behind the firewall, will encrypt the data and store the encrypted value directly to the same database of the clear-text table. Encrypted data will then be exported and shared with whoever responsible for performing the linkage. The party (honest broker) who perform the linkage will use the linked identifiers to generate the final data extract which will not include direct PHI. The specific steps of the hashing process:
Tokenize clear-text string into bi-grams
Hash bi-grams using SHA-512-based functions (each function use different 64-bit salt string)
Map the hash result into Bloom filters (bit-string)
Is this a random seeded hash algorithm or a flat algorithm without a seed?
Project 1 and 2 can yes a random seed as long as both sites use the same seed. But a random seed is not necessary in either project as further encryption of the data occur once the file is received.
Project 3 uses a random seed.
Is there some reason you absolutely need the OHMPI ID (Research ID)?
We only need an id to match, but it does not need to be the actual research ID. One suggestion (to make it easier) would be simply for you to use the format preserving encryption (with a key that is kept secret) to create a corresponding id that you can then share.
Does this have to be done using OHMPI, can I use another data source?
All we need is a gold standard, where the id has been vetted and can be confirmed to be correct in terms of matching records.
How do you want the 20,000 patient identified, random, timeline driven, or other?
It would be best for us to have both random and timeline driven.
The paper is using diagnosis information in addition to demographic data are you expecting diagnosis, OHMPI does not have that?
Additional diagnosis information can help us to block the comparison to improve algorithm efficiency, but it is not absolutely necessary.
From: David-Dimarino, Ernesto daviddim@med.usc.edu
Sent: Friday, January 20, 2017 4:52 PM
To: Jason N. Doctor; Laura LaCorte
Cc: Renelle Davis
Subject: RE: OHMPI protocol
Jason,
I reviewed the paper and information below.
Can you provide me your hash algorithm so I can review complexity to implement.
What I would like to do is get more detailed information so I can identify how to resource this request.
Questions:
Is this a random seeded hash algorithm or a flat algorithm without a seed?
Is there some reason you absolutely need the OHMPI ID (Research ID)?
Does this have to be done using OHMPI, can I use another data source?
How do you want the 20,000 patient identified, random, timeline driven, or other?
The paper is using diagnosis information in addition to demographic data are you expecting diagnosis, OHMPI does not have that?
Just to set your expectation properly, OHMPI does not have all the fields you attached to your IRB. A list of required fields would help me assess.
Thank you,
Ernesto
Ernesto David-DiMarino
Senior Director Data Management, Keck Medicine of USC
Clinical Research Informatics Services Director, SC CTSI
Keck Medicine of USC
University of Southern California
2011 Soto St. #1420
Los Angeles, California 90032
Office: 323 442 8758
Mobile: 619 933 4591
daviddim@med.usc.edu
Hi Ernesto,
I am including Toan Ong as the technical contact for Project 3. I can serve as technical contact for Projects 1 and 2. Please let us know specific questions you have about Project 3, which is the most detailed request.
Thanks,
Jason
From: David-Dimarino, Ernesto daviddim@med.usc.edu Sent: Thursday, February 2, 2017 8:20 AM To: Jason N. Doctor; Laura LaCorte Cc: Renelle Davis Subject: RE: OHMPI protocol
Jason, I think the best way to proceed is to provide me all the necessary steps and applications and the technical contact that I will work with. I will carve out some time in my calendar and run the entire process locally. I will open a project in the Data Warehouse’s SR system. This project doesn’t have an OHMPI dependency and we can leverage existing data warehouse objects to support it. I’d link to understand the “whoever responsible for perform the linkage,” I am assuming that would be my team but wasn’t sure in the way this was stated.
Hope this sounds like a plan forward.
Ernesto
From: Jason N. Doctor [mailto:jdoctor@usc.edu] Sent: Wednesday, February 1, 2017 8:18 PM To: David-Dimarino, Ernesto daviddim@med.usc.edu; Laura LaCorte LLacorte@ooc.usc.edu Cc: Renelle Davis renelled@healthpolicy.usc.edu Subject: Re: OHMPI protocol
Hi Ernesto,
Thanks for your patience. We have 3 different projects (methods) we are studying so I’ve specified how the data are needed for each by project number. (See my responses below). Let me know if you have any additional questions.
Thanks for your help,
Jason
Please define how data are hashed.
For project 1 we just need the identifier element hashed using SHA-256 hash. For project 2 we need to use format preserving encryption (which can use AES as a base) For project 3, the encryption needed is more complicated. It uses cURL. The current data encryption software is just a stand-alone Java jar file, so you need Java (version 1.7 or later) installed. You also need to install a PostgreSQL DBMS (version 9.1 or later) (or SQL Server 2012 or later) and load the clear-text data into a table. The jar file, executed behind the firewall, will encrypt the data and store the encrypted value directly to the same database of the clear-text table. Encrypted data will then be exported and shared with whoever responsible for performing the linkage. The party (honest broker) who perform the linkage will use the linked identifiers to generate the final data extract which will not include direct PHI. The specific steps of the hashing process: Tokenize clear-text string into bi-grams Hash bi-grams using SHA-512-based functions (each function use different 64-bit salt string) Map the hash result into Bloom filters (bit-string)
Is this a random seeded hash algorithm or a flat algorithm without a seed? Project 1 and 2 can yes a random seed as long as both sites use the same seed. But a random seed is not necessary in either project as further encryption of the data occur once the file is received. Project 3 uses a random seed.
Is there some reason you absolutely need the OHMPI ID (Research ID)? We only need an id to match, but it does not need to be the actual research ID. One suggestion (to make it easier) would be simply for you to use the format preserving encryption (with a key that is kept secret) to create a corresponding id that you can then share.
Does this have to be done using OHMPI, can I use another data source? All we need is a gold standard, where the id has been vetted and can be confirmed to be correct in terms of matching records.
How do you want the 20,000 patient identified, random, timeline driven, or other? It would be best for us to have both random and timeline driven.
The paper is using diagnosis information in addition to demographic data are you expecting diagnosis, OHMPI does not have that? Additional diagnosis information can help us to block the comparison to improve algorithm efficiency, but it is not absolutely necessary.
From: David-Dimarino, Ernesto daviddim@med.usc.edu Sent: Friday, January 20, 2017 4:52 PM To: Jason N. Doctor; Laura LaCorte Cc: Renelle Davis Subject: RE: OHMPI protocol
Jason, I reviewed the paper and information below. Can you provide me your hash algorithm so I can review complexity to implement. What I would like to do is get more detailed information so I can identify how to resource this request.
Questions:
Just to set your expectation properly, OHMPI does not have all the fields you attached to your IRB. A list of required fields would help me assess.
Thank you, Ernesto
Ernesto David-DiMarino Senior Director Data Management, Keck Medicine of USC Clinical Research Informatics Services Director, SC CTSI Keck Medicine of USC University of Southern California 2011 Soto St. #1420 Los Angeles, California 90032 Office: 323 442 8758 Mobile: 619 933 4591 daviddim@med.usc.edu