sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
24 stars 6 forks source link

vcf2zarr progress bars aren't updated on Mac #201

Closed tomwhite closed 3 months ago

tomwhite commented 3 months ago

For example:

vcf2zarr convert sample.vcf.gz sample.zarr -p 4
    Scan:   0%|                                       | 0.00/1.00 [00:00<?, ?files/s]
 Explode:   0%|                                        | 0.00/9.00 [00:00<?, ?vars/s]
  Encode:   0%|                                            | 0.00/927 [00:00<?, ?B/s]
Finalise: 100%|███████████████████████████████| 21.0/21.0 [00:00<00:00, 1.64karray/s]

Running on a bigger VCF is the same: the first three bars are never updated.

This is using Python 3.10 on a Mac M1 (13.5.2) running under Rosetta.

jeromekelleher commented 3 months ago

Great - this is a dupe of #175 then

jeromekelleher commented 3 months ago

Seem likely the problem is to do with the shared memory approach taken here. Any insights into what's going on in Mac-land here would be much appreciated!

tomwhite commented 3 months ago

Mac uses spawn by default, while Linux uses fork - see https://docs.python.org/3/library/multiprocessing.html#multiprocessing-start-methods.

When I change to use fork, it works:

diff --git a/bio2zarr/core.py b/bio2zarr/core.py
index bd31656..730d85f 100644
--- a/bio2zarr/core.py
+++ b/bio2zarr/core.py
@@ -166,6 +166,7 @@ class ParallelWorkManager(contextlib.AbstractContextManager):
         else:
             self.executor = cf.ProcessPoolExecutor(
                 max_workers=worker_processes,
+                mp_context=multiprocessing.get_context("fork")
             )
         self.futures = []

However, it's not clear how safe this is, see https://stackoverflow.com/a/70552892:

fork might be OK as long as you don't use threads

Another suggestion is to use https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Manager.

Finally, in Python 3.14 the default will change away from fork for all platforms.

tomwhite commented 3 months ago

Another way to do this would be to update progress as results are returned from results_as_completed. This would allow any of the various ways to start a process (fork, spawn etc).

Would that work?

jeromekelleher commented 3 months ago

That was super helpful, thanks @tomwhite. I ended up using the shared memory multiprocessing.Value after trying out a lot of different options, and it seemed to the best balance of robustness and responsiveness. I agree the architecture isn't great, coupling over a global variable, but doing progress well is hard.

I switched everything over to "spawn", and got it working again, so should be the same on all platforms now.

jeromekelleher commented 3 months ago

Would you mind trying out the latest code to see if the progress works OK please @tomwhite?

tomwhite commented 3 months ago

I just tried it and it works great! Thanks for fixing it @jeromekelleher

BTW I do get a UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown but the progress is working now.

jeromekelleher commented 3 months ago

BTW I do get a UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown but the progress is working now

Good to know - can you open an issue to track this one please?

jeromekelleher commented 3 months ago

Any other usability type-things you see on mac would be great to know about btw.