medic / test-data-generator

Tool to upload test data into CHT test instances.
GNU General Public License v3.0
2 stars 2 forks source link

Support writing separate designs simultaniously #17

Closed jkuester closed 2 weeks ago

jkuester commented 8 months ago

https://github.com/latin-panda/test-data-generator/pull/14 made it so that only one level of one design is processed at a given time (basically we start at the top parent of a design, write the docs for that, then move down and do the same one-by-one for each of the children in a DFS tree iteration). This sequential processing allows for processing many thousands of designs without running out of heap space (or just memory in general). But, it is "single-threaded" and will take a lot longer to process a large number of designs than a multi-threaded approach.

@latin-panda proposed that we use worker threads to allow for simultaneous processing of designs! Even just supporting 2-4 additional threads could likely have a significant impact on how fast designs could be processed (just have to make sure the server does not start rate-limiting)!

jkuester commented 8 months ago

For the record, now that my designs are taking multiple hours to execute, I am back to looking for a solution to this! :sweat_smile: Unfortunately, it does not look like the worker threads are going to be very easy to leverage. I tried pulling in https://www.npmjs.com/package/workerpool, but what I ran into was that you cannot pass functions as parameters to the new worker threads. This is a problem since our designs, which is what we would want to pass into the worker threads, contain the getDoc function. It seems like serialization of parameters is a more-or-less fundamental limitation of worker threads (and not just workerpool). The "solution" would be to make each design file a "worker", that does not seem to get us anywhere since we need to be able to split out the whole interaction with the design file and the generator code into a separate thread...

So, for the moment I am brainstorming other, non-thread-based, options here that might speed up writing without blowing up the heap.

jkuester commented 2 weeks ago

Closing this in favor of https://github.com/medic/test-data-generator/issues/21.

After more testing, it is clear that the single-threaded processing of designs is not the primary bottleneck. Instead Couch view indexing and network traffic speed are almost certain to remain the limiting factors (regardless of how many threads we use to process the designs).