orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Resolve Fragmentation PerformanceWarnings in code #81

Closed TravisHaussler closed 1 year ago

TravisHaussler commented 1 year ago

We are seeing a bunch of these alerts:

/home/travis/IdeaProjects/synthetic-database-project/synth_data_module/format.py:191: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  self.output_df['Type of Care'] = 1
/home/travis/IdeaProjects/synthetic-database-project/synth_data_module/format.py:192: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  self.output_df['Facility Identification Number'] = self.facility_id
/home/travis/IdeaProjects/synthetic-database-project/synth_data_module/format.py:193: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  self.output_df['Not in Use'] = '     '

Investigate and resolve

TravisHaussler commented 1 year ago

Instead of doing a call like:

self.output_df['Type of Care'] = 1

We do this instead:

care_s = pd.Series([1 for x in range(len(self.output_df.index))]) 
self.output_df = pd.concat([self.output_df,care_s.rename('Type of Care')], axis=1)