Understand the Timescale of our Code

NickKramer87 commented 6 months ago

As a user of the Orchid Initiative code, I want it to run as quickly as possible so that I can generate as large a patient dataset as I need in the shortest possible time. In order to accomplish this, I must become familiar with the timescales of the existing code in order to identify any potential bottlenecks.

Developer Guide: https://github.com/orchid-initiative/synthetic-database-project/blob/main/documentation/DEVELOPER_GUIDE.md

Action Items

Brief summary of the time needed to run code segments.
Plan to improve any significant bottlenecks.

beckyphan commented 5 months ago

Currently takes around 30 min to generate 1000 records
Hit an error when trying to generate 5000 records; revisiting to troubleshoot error in order to devise a plan to improve significant bottlenecks;

beckyphan commented 5 months ago

Progress Updates 6/21:

Travis Haussler oh, btw, 5500 is too large a set to run, I am still working on how to expand the capabilities for the program, do something less than 2500 for now to have it finish. Also for context, running -N 1000 takes about 15 min4 and its fairly linear until it just breaks somewhere between N 1000 and N 5000 (which, again I am investigating)

Travis Haussler I found a key area of date calculation that needs to be vectorized in order to run much more efficiently.

orchid-initiative / synthetic-database-project

Understand the Timescale of our Code #98