pSCANNER / Distributed-Methods-PPRL

PPRL algorithms in distributed computation framework
2 stars 0 forks source link

Follow-up with Ernesto on specifications for OHMPI protocol #3

Open tara-knight opened 7 years ago

tara-knight commented 7 years ago

Hi Ernesto,

I am including Toan Ong as the technical contact for Project 3. I can serve as technical contact for Projects 1 and 2. Please let us know specific questions you have about Project 3, which is the most detailed request.

Thanks,

Jason


From: David-Dimarino, Ernesto daviddim@med.usc.edu Sent: Thursday, February 2, 2017 8:20 AM To: Jason N. Doctor; Laura LaCorte Cc: Renelle Davis Subject: RE: OHMPI protocol

Jason, I think the best way to proceed is to provide me all the necessary steps and applications and the technical contact that I will work with. I will carve out some time in my calendar and run the entire process locally. I will open a project in the Data Warehouse’s SR system. This project doesn’t have an OHMPI dependency and we can leverage existing data warehouse objects to support it. I’d link to understand the “whoever responsible for perform the linkage,” I am assuming that would be my team but wasn’t sure in the way this was stated.

Hope this sounds like a plan forward.

Ernesto

From: Jason N. Doctor [mailto:jdoctor@usc.edu] Sent: Wednesday, February 1, 2017 8:18 PM To: David-Dimarino, Ernesto daviddim@med.usc.edu; Laura LaCorte LLacorte@ooc.usc.edu Cc: Renelle Davis renelled@healthpolicy.usc.edu Subject: Re: OHMPI protocol

Hi Ernesto,

Thanks for your patience. We have 3 different projects (methods) we are studying so I’ve specified how the data are needed for each by project number. (See my responses below). Let me know if you have any additional questions.

Thanks for your help,

Jason

  1. Please define how data are hashed.
    For project 1 we just need the identifier element hashed using SHA-256 hash. For project 2 we need to use format preserving encryption (which can use AES as a base) For project 3, the encryption needed is more complicated. It uses cURL. The current data encryption software is just a stand-alone Java jar file, so you need Java (version 1.7 or later) installed. You also need to install a PostgreSQL DBMS (version 9.1 or later) (or SQL Server 2012 or later) and load the clear-text data into a table. The jar file, executed behind the firewall, will encrypt the data and store the encrypted value directly to the same database of the clear-text table. Encrypted data will then be exported and shared with whoever responsible for performing the linkage. The party (honest broker) who perform the linkage will use the linked identifiers to generate the final data extract which will not include direct PHI. The specific steps of the hashing process: Tokenize clear-text string into bi-grams Hash bi-grams using SHA-512-based functions (each function use different 64-bit salt string) Map the hash result into Bloom filters (bit-string)

  2. Is this a random seeded hash algorithm or a flat algorithm without a seed? Project 1 and 2 can yes a random seed as long as both sites use the same seed. But a random seed is not necessary in either project as further encryption of the data occur once the file is received. Project 3 uses a random seed.

  3. Is there some reason you absolutely need the OHMPI ID (Research ID)? We only need an id to match, but it does not need to be the actual research ID. One suggestion (to make it easier) would be simply for you to use the format preserving encryption (with a key that is kept secret) to create a corresponding id that you can then share.

  4. Does this have to be done using OHMPI, can I use another data source? All we need is a gold standard, where the id has been vetted and can be confirmed to be correct in terms of matching records.

  5. How do you want the 20,000 patient identified, random, timeline driven, or other? It would be best for us to have both random and timeline driven.

  6. The paper is using diagnosis information in addition to demographic data are you expecting diagnosis, OHMPI does not have that? Additional diagnosis information can help us to block the comparison to improve algorithm efficiency, but it is not absolutely necessary.


From: David-Dimarino, Ernesto daviddim@med.usc.edu Sent: Friday, January 20, 2017 4:52 PM To: Jason N. Doctor; Laura LaCorte Cc: Renelle Davis Subject: RE: OHMPI protocol

Jason, I reviewed the paper and information below. Can you provide me your hash algorithm so I can review complexity to implement. What I would like to do is get more detailed information so I can identify how to resource this request.

Questions:

  1. Is this a random seeded hash algorithm or a flat algorithm without a seed?
  2. Is there some reason you absolutely need the OHMPI ID (Research ID)?
  3. Does this have to be done using OHMPI, can I use another data source?
  4. How do you want the 20,000 patient identified, random, timeline driven, or other?
  5. The paper is using diagnosis information in addition to demographic data are you expecting diagnosis, OHMPI does not have that?

Just to set your expectation properly, OHMPI does not have all the fields you attached to your IRB. A list of required fields would help me assess.

Thank you, Ernesto

Ernesto David-DiMarino Senior Director Data Management, Keck Medicine of USC Clinical Research Informatics Services Director, SC CTSI Keck Medicine of USC University of Southern California 2011 Soto St. #1420 Los Angeles, California 90032 Office: 323 442 8758 Mobile: 619 933 4591 daviddim@med.usc.edu

tara-knight commented 7 years ago

Call scheduled with Ernesto for 3/22 @ 2pm