gordonkoehn / LolliPop

Deconvolution for Wastewater Genomics
GNU General Public License v3.0
0 stars 0 forks source link

Ensure Ease of Integration into V-Pipe, and Merge #4

Open gordonkoehn opened 16 hours ago

gordonkoehn commented 16 hours ago

Check for the ease of running this in the current V-Pipe.

Prepare and submit this as a good PR.

gordonkoehn commented 16 hours ago

First thoughts: The core loops are:

Key Iterations Runtime per iter Memory Potential Reduction
main 1 est. 4.6 h. 1414 MB pd.df_tally ???
location 8 35min for 100 b.s. ~186 MB (df_tally[location]) 1/8
bootstraps (b.s.) 100 min - 1000 optimum 21 s total at 100 iters ~190 MB (df_tally[location] resampled dep. on available cores, say 1/5 - 1/10
date_intervals (#dates - 1) = 12 0.5-6s 7.2872 MB (df_tally[location][resampled][date_interval] startup overhead 1/2 ?
#dates ~13 (one sample per week) XXX XXX

That makes for an estimate of est. 4.6 h total runtime for 8 cities, given they are the same as Zürich. (M1 Pro Chip)

These levels seem to be independent at first sight.

_I could imagine that ll.KernelDeconv could be parralized - perhaps easiest would be the most inner looüp for the date_intervals_.

In general to use python's multiprocessing the following must be true:

See Co-Pilote:

Using the KernelDeconv class and its deconv method with multiprocessing should generally work, but there are a few considerations to keep in mind:

Thread Safety: Ensure that the objects and methods used within KernelDeconv are thread-safe. This includes the kernel, regressor, and confidence interval objects.

Data Sharing: When using multiprocessing, data is typically copied to each process. If the data is large, this can be inefficient. Consider using shared memory or other techniques to manage large datasets.

Pickleability: Objects passed to multiprocessing must be pickleable. Ensure that all objects and methods used in KernelDeconv can be serialized with pickle.

To Check:

gordonkoehn commented 14 hours ago

Currently is V-Pipe running LollliPop with the Resources: threads: 1 memory: 1024 MB disk_mb 1024

See config_shema.json and rule deconvolution

There is something I don't get about this memory because for my current date range this would mean it would not run as is. No, as V-Pipe seems to run Lollipop already stratified per location.

gordonkoehn commented 13 hours ago

Conclusion: Both Multiprocessing at the level of location and bootstraps seems feasible and reasonable. At the level of location_intervals there is probably to much overhead for the short iteration duration. Multiprocessing at the level or bootstraps would further allow to speed up single location applications.

Using bootstraps as multiprocessing would allow us to fix the number of cores.

The memory would stay within the bounds of a normal 1 GB job, and we would split the single job to about ten jobs max in either case – a reasonable investment of resources.

gordonkoehn commented 11 hours ago

Integration in Snakemake This should be no problem, we would just need to add a flag to the rule to allow threads=10