orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Understand the Timescale of our Code #98

Closed NickKramer87 closed 1 month ago

NickKramer87 commented 4 months ago

As a user of the Orchid Initiative code, I want it to run as quickly as possible so that I can generate as large a patient dataset as I need in the shortest possible time. In order to accomplish this, I must become familiar with the timescales of the existing code in order to identify any potential bottlenecks.

Developer Guide: https://github.com/orchid-initiative/synthetic-database-project/blob/main/documentation/DEVELOPER_GUIDE.md

Action Items

  1. Brief summary of the time needed to run code segments.
  2. Plan to improve any significant bottlenecks.
beckyphan commented 2 months ago
  1. Currently takes around 30 min to generate 1000 records
  2. Hit an error when trying to generate 5000 records; revisiting to troubleshoot error in order to devise a plan to improve significant bottlenecks;
beckyphan commented 2 months ago

Progress Updates 6/21:

Travis Haussler oh, btw, 5500 is too large a set to run, I am still working on how to expand the capabilities for the program, do something less than 2500 for now to have it finish. Also for context, running -N 1000 takes about 15 min4 and its fairly linear until it just breaks somewhere between N 1000 and N 5000 (which, again I am investigating)

Travis Haussler I found a key area of date calculation that needs to be vectorized in order to run much more efficiently.