libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.86k stars 180 forks source link

Add option to seed numba RNG #89

Open juliustao opened 2 years ago

juliustao commented 2 years ago

Calling numpy.random.seed at the start of code can set the global seed for all numpy.random methods. However, calling this method from Python does not seed the numba generator, so numba JIT-compiled code is nondeterministic (e.g., the cutout_square() function returned by generate_code() in the Cutout operation). See this numba documentation for further details. Adding an optional seed argument to such random transforms allows reproducibility across runs.

GuillaumeLeclerc commented 2 years ago

Hi! I'm not sure doing this part of the transforms is the optimal way as every transform with randomness will have to add it, and users have to set the seed many times. Moreover it will be ran at each batch which will slow down things. Maybe there is a better place. Any idea how long does the random state in numba lives ? Is it per thread or per process ?

juliustao commented 2 years ago

Thanks for the quick response Guillaume!

I agree that this is not a great way to seed the random state in numba. I wasn't sure how to modify the Operation parent class so that np.random.seed(seed) could be called in an arbitrary function returned by generate_code.

A nicer solution would be to seed the numba random state once for all future JIT-compiled functions.

I'm not too familiar with numba, but the documentation linked above says that

Since version 0.28.0, the generator is thread-safe and fork-safe. Each thread and each process will produce independent streams of random numbers.

This numba thread suggests that it's possible to set the numba random state once at the start for determinism in single-threaded code. I'll dig into this more and run some tests once the slurm cluster is back up.

Hope that clarified some questions :)

GuillaumeLeclerc commented 2 years ago

I think the best would be to do it at the start of loading and use the seed argument. Doing it in the operations mean you get the same random sequence at each batch which most likely will produce adverse results during training.

juliustao commented 2 years ago

I was able to get determinism after using training.num_workers = 1 with

torch.backends.cudnn.deterministic = True
torch.manual_seed(SEED)
np.random.seed(SEED)

in train_cifar.py and setting the numba seed inside the EpochIterator thread.

I cannot set the numba seed in train_cifar.py like numpy or torch since every thread has an independent numba state. This solution is still suboptimal, and I hope there's a simple fix that I overlooked.

juliustao commented 2 years ago

Also, I'm confused about the threading in ffcv: why is the EpochIterator object returned by iter(Loader()) implemented as a Thread? Is it since waiting for the cuda stream takes significant time where we can perform other cpu operations?

GuillaumeLeclerc commented 2 years ago

I suspect that it would work with multiple workers too as workers are only activate in the body of the transforms. Have you tried that?

EpochIterator is implemented as a thread so that the augmentations (especially the CPU ones) are not blocking the main training loop of the user

juliustao commented 2 years ago

With the code above, setting training.num_workers > 1 does not give deterministic results :(

I haven't confidently figured out why that's the case, but I suspect that the cause is numba threads interleaving randomly.

GuillaumeLeclerc commented 2 years ago

Oh that's the good call in my opinion. Not sure yet how to get around that problem. Do you personally have use cases where determinism is needed. Usually determinism doesn't play well with high-performance code (cuddn deterministic mode can be significantly slower too)

juliustao commented 2 years ago

My current work is looking at how fixing different sources of randomness affects training outcomes, and data augmentations are an example. Maybe this use case is rather niche, and the changes are not worth the hit in performance. Hopefully this thread at least can help others with similar issues :)

juliustao commented 2 years ago

On a related note, is the desired default behavior of the Random TraversalOrder to have the same shuffle order across independent runs? The default is self.seed = self.loader.seed = 0, which implies the above since the seed for each epoch is always self.seed + epoch.

GuillaumeLeclerc commented 2 years ago

Hello,

It was intended but we decided to change this in v0.0.3. the release candidate is available on pip already. See announcements for more details .

On Mon, Jan 24, 2022, 8:15 AM Julius Tao @.***> wrote:

On a related note, is the desired default behavior of the Random TraversalOrder to have the same shuffle order across independent runs? The default is self.seed = self.loader.seed = 0, which implies the above since the seed for each epoch is always self.seed + epoch.

— Reply to this email directly, view it on GitHub https://github.com/libffcv/ffcv/pull/89#issuecomment-1020272178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPMOG6467F5I6ZOXAWXIBTUXV3KZANCNFSM5MSVZ37A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

heitorrapela commented 1 year ago

Did you succeed to run ffcv deterministic @juliustao? I am facing similar problem with a gap of more than 5 points in my metric with same code on different runs. I seed everything, but I was looking for how to seed the workers as in pytorch dataloaders, but could not find it.