Open liangz1 opened 4 years ago
Same issue
So as a quick fix to this since i can't quite find where the error is coming from (tried resetting things manually gradually by switching some variables and reseting ventilator / reader).
the error doesn't seem to come from the reader itself since i even tried to completly change the reader variable manually from the dataloader with make_batch_reader.
so for now, i save the Torch Dataset Manager, and generate a new dataloader everytime i want to reset things...:
manager = converter.make_torch_dataloader(...)
for x in range(3):
print("epoch", x)
train_data_loader = manager.__enter__()
i = iter(train_data_loader)
for batch_number in range(n_batches):
pd_batch = next(i)
manager.__exit__(None, None, None)
Or, this seems to work better as it exits the data loader and avoid errors.
manager = converter.make_torch_dataloader(...)
for x in range(3):
with manager as train_data_loader:
print("epoch", x)
i = iter(train_data_loader)
for batch_number in range(n_batches):
pd_batch = next(i)
Note that in both codes, you'll need to specify num_epochs = 1 in the make_torch_dataloader function
When using
conv.make_torch_dataloader(num_epochs=1) as dataloader
, thedataloader
should support multiple calls ofenumerate(dataloader)
. Use the following code snippet as an example, we define the expected behavior:Expected behavior:
Actual behavior:
I run it on petastorm==0.9.0, both pyspark==2.4.5 on my laptop and pyspark==3.0.0. The outputs are identical in both runs.