Closed RasmusOrsoe closed 2 months ago
@ArturoLlorente thank you very much for pointing out these two errors! I have added a few commits that addresses them, and appended a few checks in the unit test of the DataModule to verify that the shuffle argument is set as expected.
This PR revisits the bug documented in #683 which I thought was previously fixed in #686.
Because Polars uses optimization strategies when executing code, it will behave slightly differently under the hood depending on the data. In some cases it will use multithreading while in other cases it won't. #683 is caused by thread lock from polars multithreaded operations, and I thought in #686 that I had removed the line causing the issue, but when revisiting this on a different dataset, I found that the issue was now caused from other lines of polars code.
Issues with polars and "forking" multiprocessing context is mentioned here.
I have found what appears to be a definitive solution to the bug. By simply specifying
when creating dataloaders with the
ParquetDataset
, the threadlocking disappears.This PR addresses this specific requirement to the![image](https://github.com/graphnet-team/graphnet/assets/48880272/bf6916bd-b007-4903-b3ab-9c3d8c3243cf)
ParquetDataset
by checking for the argument and providing (hopefully) helpful logger messages (see screenshot).I also took the liberty of having
test_dataloader_kwargs
and theval_dataloader_kwargs
default totrain_dataloader_kwargs
in cases where there is a test/val dataloader but no arguments for their respective dataloaders.