Suggested use of pre-processed data

sdimi commented 3 years ago

Hi @mmcdermott,

really awesome project and resource! I was wondering if there's any suggested use of the pre-processed data and splits. I see that some of the .pkl files are actually dictionaries. Do you recommend pandas to pre-process them into a pytorch dataloader or something else? Sorry I cannot seem to find any relevant code in the repo.

mmcdermott commented 3 years ago

Hi @sdimi

The files should (I believe, but it has been a while) be dictionaries mapping dataset split keys (what we often call 'rotations' in our code) to the actual train, val, and test datasets associated with that split. Those datasets should not need to be additionally processed, but should be instances of the PatientDataset class in our code, which are pytorch Dataset-derived classes.

If you put these pkl files on disk and point to them with the dataset args in accordance with the args.py file and the code in run_model.py (which is called a lot from our scripts), then it will automatically load them and use them in modelling.

This isn't very well documented as of yet, for which I apologize, but if you have trouble after taking a look at the run-model code, I'm happy to help debug things, either via issues / emails or with a video call or something, just let me know what issues you're running into.

sdimi commented 2 years ago

Thanks a lot for the pointers, @mmcdermott! Really appreciate your offer to help, I'll let you know :)

mmcdermott / comprehensive_MTL_EHR

Suggested use of pre-processed data #4