graphnet-team / graphnet

A Deep learning library for neutrino telescopes
https://graphnet-team.github.io/graphnet/
Apache License 2.0
85 stars 85 forks source link

Update DataModule #694

Closed RasmusOrsoe closed 2 months ago

RasmusOrsoe commented 2 months ago

This PR revisits the bug documented in #683 which I thought was previously fixed in #686.

Because Polars uses optimization strategies when executing code, it will behave slightly differently under the hood depending on the data. In some cases it will use multithreading while in other cases it won't. #683 is caused by thread lock from polars multithreaded operations, and I thought in #686 that I had removed the line causing the issue, but when revisiting this on a different dataset, I found that the issue was now caused from other lines of polars code.

Issues with polars and "forking" multiprocessing context is mentioned here.

I have found what appears to be a definitive solution to the bug. By simply specifying

dataloader = DataLoader(...,
                        multiprocessing_context = "spawn",
                        ..)

when creating dataloaders with the ParquetDataset, the threadlocking disappears.

This PR addresses this specific requirement to the ParquetDataset by checking for the argument and providing (hopefully) helpful logger messages (see screenshot). image

I also took the liberty of having test_dataloader_kwargs and the val_dataloader_kwargs default to train_dataloader_kwargs in cases where there is a test/val dataloader but no arguments for their respective dataloaders.

RasmusOrsoe commented 2 months ago

@ArturoLlorente thank you very much for pointing out these two errors! I have added a few commits that addresses them, and appended a few checks in the unit test of the DataModule to verify that the shuffle argument is set as expected.