orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Formatter factory #85

Closed TravisHaussler closed 7 months ago

TravisHaussler commented 9 months ago

Set up a formatting factory for multiple formatters, split each formatter into its own python file and split out common helpers into their own file

TravisHaussler commented 9 months ago

I have not fully implemented all of the elements we suggested in our review, but I got cracking on some of them while preserving the ability to run successfully.

  1. I do not have a clear understanding of the synth_data_module imports (i.e. the init.py setup) and ran into some conflicting dependencies. I got around this by separating out more functions, but it would be better for me to understand this so I could have seen this coming.

  2. I do not have a full understanding over arg passing as a parameter or whether they are somehow globally available via argparser. Or perhaps we are supposed to run argparser in multiple places. Not sure, but what I have currently works at least

TravisHaussler commented 8 months ago

Thanks Mason!

I did have quite some trouble getting the args to work, I don’t have a full understanding of how they are unpacked and passed, especially with subclassing using super.

Doc strings and logging is a great thing for me to work on improving my understanding and practice

You are ok with the general base classing and stuff, abstract methods, and synthea csv reads class?

Travis

On Fri, Dec 15, 2023 at 12:51 AM Mason Smith @.***> wrote:

@.**** commented on this pull request.

In synth_data_module/HCAI_inpatient_format.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427698146 :

@@ -0,0 +1,155 @@ +import datetime as dt +from synth_data_module import HCAIBase, get_procedure_list, get_diagnosis_list +import pandas as pd +import numpy as np +import synth_data_module.mappings as mappings +import time +from io import StringIO + +class HCAIInpatientFormat(HCAIBase):

  • def init(self, **kwargs):
  • super().init(**kwargs)

Here (and in HCAI_PDD_format.py), you want to be more explicit about the args that you want to take. You can still have the caller pass in **kwargs ,

In synth_data_module/HCAI_base_format.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427699884 :

@@ -0,0 +1,189 @@ +from abc import ABC, abstractmethod +from configparser import ConfigParser +from synth_data_module import Formatter, SyntheaOutput, modify_row +import datetime as dt +import pandas as pd +import numpy as np +import synth_data_module.mappings as mappings + + + +class HCAIBase(Formatter, ABC):

  • def init(self, **kwargs):
  • self.output_df = pd.DataFrame

Similar comment as for HCAI_PDD_format.py. You want something like:

def init(self, facility_id, **kwargs): self.facility_id = facility_id ....


In synth_data_module/HCAI_inpatient_format.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427701702 :

  • payers = self.synthea_output.payers_df()
  • payers['payer_id'] = payers.iloc[:, 0]
  • payers['Payer Category'] = payers.apply(mappings.payer_category, axis=1).astype(str)
  • encounters = encounters.merge(payers[['payer_id', 'Payer Category']], how='left', left_on='payer_id',
  • right_on='payer_id')
  • print('SUB-CHECK - Payers and codes merged. Encounters Shape: ', encounters.shape)
  • Merge the encounters dataframe into self.output_df, keeping only the fields we care about

  • self.output_df = self.output_df.merge(encounters[['encounter_id', 'Admission Date', 'Discharge Date',
  • 'Principal Diagnosis', 'Total Charges', 'Payer Category',
  • 'Facility Name']],
  • how='left', left_on='patient_id', right_on=encounters.iloc[:, 3])
  • print('Encounter info added. Shape: ', self.output_df.shape)
  • self.output_df = self.output_df.dropna(subset=['encounter_id']).reset_index(drop=True)
  • print('Patients with no encounters of desired type dropped. Shape: ', self.output_df.shape)
  • del encounters

this del encounters statement doesn't do anything.

In synth_data_module/formatting_factory.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427704089 :

@@ -0,0 +1,10 @@ +from synth_data_module import HCAIInpatientFormat, HCAIPDDFormat, Formatter +from io import StringIO + + +# Factory method +def create_formatter(**kwargs) -> Formatter: #, settings: argparse.Namespace

Similarly here, you want something like:

def create_formatter(format_type, kwargs) -> Formatter: if "HCAI_Inpatient" == format_type: return HCAIInpatientFormat(kwargs["facilityid"], kwargs) ....


In synth_data_module/formatting_helpers.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427705827 :

  • return diagnosis_list
  • +def get_morbidity_list():

  • morbidity_list = []
  • for i in range(1, 13):
  • morbidity = {'fieldid': f'ecm{i}', 'name': f'External Causes of Morbidity {i}', 'length': 8, 'justification': 'left'}
  • morbidity_list.append(morbidity)
  • present_on_admission = {'fieldid': f'epoa{i}', 'name': f'POA - External Causes of Morbidity {i}', 'length': 1,
  • 'justification': 'left'}
  • morbidity_list.append(present_on_admission)
  • return morbidity_list
  • +def modify_row(row, df_fields, new_fields):

This probably needs a more descriptive name, like add_fields_split_by_index or something.

In synth_data_module/formatting_helpers.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427708231 :

  • if isinstance(first, tuple) and isinstance(second, tuple):
  • for code, date, i in zip(first, second, range(1, 26)):
  • if i == 1:
  • if new_fields[0] == 'Diagnosis':
  • row[f'Principal {new_fields[0]}'] = code
  • row[f'{new_fields[1]} for Principal {new_fields[0]}'] = date
  • else:
  • row[f'Principal {new_fields[0]}'] = code
  • row[f'Principal {new_fields[1]}'] = date
  • else:
  • row[f'{new_fields[0]} {i}'] = code
  • row[f'{new_fields[1]} {i}'] = date
  • return row
  • +# HCAI has specific criteria for age reporting +def calculate_age(x, y, form="agedays"):

Comments at the function or class level should use triple-quote style comments https://realpython.com/python-comments-guide/#when-writing-code-for-others aka docstrings. Most editors understand how to display these when you hover over a function name, and they show up when you run help(func) in the interpreter, ipython, or jupyter.

In synth_data_module/formatting_helpers.py https://github.com/orchid-initiative/synthetic-database-project/pull/85#discussion_r1427710719 :

  • files2 = glob.glob(output_fhirs)
  • for f in files2:
  • os.remove(f)
  • files3 = glob.glob(output_metadata)
  • for f in files3:
  • os.remove(f)
  • +def parse_city(specify_city=None):

  • city = None
  • state = None
  • if specify_city:
  • if len(specify_city.split(",")) == 2:
  • city = str(specify_city.split(",")[0])
  • state = str(specify_city.split(",")[1])
  • print("Overriding default location Massachusetts to city, state: %s, %s " % (city, state))

There are a lot of places where you have semi-info, semi-debug statements like this. They're fine for development, but if you want to leave them in you probably want to use the logging https://docs.python.org/3/library/logging.html module more extensively. This will make it easier to turn these print statements on or off without going through all of the code.

— Reply to this email directly, view it on GitHub https://github.com/orchid-initiative/synthetic-database-project/pull/85#pullrequestreview-1783356404, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZL3CWINQDCWZIHVJRCVKQTYJQFRLAVCNFSM6AAAAABAIH3RWCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOOBTGM2TMNBQGQ . You are receiving this because you authored the thread.Message ID: <orchid-initiative/synthetic-database-project/pull/85/review/1783356404@ github.com>

TravisHaussler commented 7 months ago

Lets get this closed to not let it hang