[MNT] windows compatibility

fkiraly commented 2 weeks ago

Tests currently fail on windows (windows-latest)

all python versions: libuv issues, wee https://github.com/jdb78/pytorch-forecasting/pull/1622. We should check (a) whether this is CI specific or a deeper compatibility issue, and (b) fix it.
python 3.10-3.12, separate issue: https://github.com/jdb78/pytorch-forecasting/issues/1632

benHeid commented 2 weeks ago

Libuv issues seems to be introduced by torch 2.4.0

Recently, we have rolled out a new TCPStore server backend using libuv, a third-party library for asynchronous I/O. This new server backend aims to address scalability and robustness challenges in large-scale distributed training jobs, such as those with more than 1024 ranks. We ran a series of benchmarks to compare the libuv backend against the old one, and the experiment results demonstrated significant improvements in store initialization time and maintained a comparable performance in store I/O operations.

As a result of these findings, the libuv backend has been set as the default TCPStore server backend in PyTorch 2.4. This change is expected to enhance the performance and scalability of distributed training jobs.

Source: https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html

Let me try to figure out how to configure it correctly with libuv...

benHeid commented 2 weeks ago

So according to the tutorial it should be possible to switch off by:

setting use_libuv=False when creating dist.TCPStore -> Not applicable since not created directly.
set init_method=f"tcp://{addr}:{port}?use_libuv=0", in dist.init_process_group unfortunately, we have no direct control since it is part of PyTorch lightning.
set os.environ["USE_LIBUV"] = "0" I do not want to do something like that... :/

Other option would be to not test with DDP Strategy, or to downgrade PyTorch.. Unfortunately, I have no windows system right now so I cannot produce a minimal example to perhaps create an issue at pytorch-lightning so that they might expose the relevant parameters

fkiraly commented 2 weeks ago

I do have a windows system, can you be specific what we'd need - just an MRE for the failure, or sth more specific?

benHeid commented 2 weeks ago

You might check if this is failing with PyTorch 2.4.0

```python import pytorch_lightning as pl import numpy as np import torch from torch.nn import MSELoss from torch.optim import Adam from torch.utils.data import DataLoader, Dataset import torch.nn as nn class SimpleDataset(Dataset): def __init__(self): X = np.arange(10000) y = X * 2 X = [[_] for _ in X] y = [[_] for _ in y] self.X = torch.Tensor(X) self.y = torch.Tensor(y) def __len__(self): return len(self.y) def __getitem__(self, idx): return {"X": self.X[idx], "y": self.y[idx]} class MyModel(pl.LightningModule): def __init__(self): super().__init__() self.fc = nn.Linear(1, 1) self.criterion = MSELoss() def forward(self, inputs_id, labels=None): outputs = self.fc(inputs_id) loss = 0 if labels is not None: loss = self.criterion(outputs, labels) return loss, outputs def train_dataloader(self): dataset = SimpleDataset() return DataLoader(dataset, batch_size=1000) def training_step(self, batch, batch_idx): input_ids = batch["X"] labels = batch["y"] loss, outputs = self(input_ids, labels) return {"loss": loss} def configure_optimizers(self): optimizer = Adam(self.parameters()) return optimizer if __name__ == '__main__': model = MyModel() trainer = pl.Trainer( max_epochs=1, accelerator="cpu", strategy="ddp") trainer.fit(model) X = torch.Tensor([[1.0], [51.0], [89.0]]) _, y = model(X) print(y) ```

Hopefully this is the issue with the strategy ddp.

fkiraly commented 2 weeks ago

I can reproduce the error on windows 11, torch 2.4.0, python 3.10, same failure, last lines of traceback:

    return TCPStore(
RuntimeError: use_libuv was requested but PyTorch was build without libuv support

benHeid commented 2 weeks ago

Ok I would propose to open an Issue at PyTorch lightning. And perhaps remove the ddp strategy for testing at least for windows or set the environment variable to use the old store and not the Libuv one.

fkiraly commented 2 weeks ago

I've added a skip here https://github.com/jdb78/pytorch-forecasting/pull/1631, but haven't closed the issue, as the skip of course does not causally solve this...

jdb78 / pytorch-forecasting

[MNT] windows compatibility #1623