Closed david-waterworth closed 10 months ago
@david-waterworth thanks for reporting ! Yes, this makes sense that we can't reuse training iterator twice. In case if we want to see model generalization <-> compute metrics on validation and training data, we can create 3 data pipes: 1) training only datapipe, 2) validation datapipe and 3) another training datapipe for evaluation only that filtered to have roughly similar number of samples as in validation datapipe (if possible). Such that we can construct 3 dataloaders (like here) and I think we should be able to get rid of the datapipe limitation... What do you think ?
@vfdev-5 yes that works, I wasn't sure at first how to implement it using datapipes
but I noticed that you can pass the request the same split multiple times, i.e.
train_datapipe, train_val_datapipe, test_datapipe= DATASETS[args.dataset](root=args.data_dir, split=('train', 'train', 'test'))
For completeness the code that constructs the datapipe is below, the _wrap_split_argument
enables the function to be called with a tuple of split names.
def _filepath_fn(root, split, _=None):
return os.path.join(root, split + ".csv")
def _parse_fields(t):
return dict(text=t[1].strip(), label=t[2])
@_create_dataset_directory(dataset_name="mydataset")
@_wrap_split_argument(("train", "test"))
def datapipe(root: str, split: Union[Tuple[str], str]):
"""
Args:
root: Directory where the datasets are saved. Default: os.path.expanduser('~/.torchtext/cache')
split: split or splits to be returned. Can be a string or tuple of strings. Default: (`train`, `test`)
"""
filepath_dp = IterableWrapper([_filepath_fn(root,split)])
data_dp = FileOpener(filepath_dp, encoding="utf-8") \
.parse_csv(skip_lines=1) \
.map(fn=_parse_fields) \
.shuffle() \
.sharding_filter()
return data_dp
I need to look closer at the ignite code though, I would assume that the end epoch event is fired outside the train iter in which case I'm not sure why the iterator isn't reset.
Resetting iterators should be in general done manually. Here is a how-to guide about majority of cases with iterators.
Let me close this issue as solved, feel free to reopen if something is still unclear
❓ Questions/Help/Support
I'm having issues with the quickstart. The issue seems to be that I'm using
torchdata.datapipes
to construct my dataloadersSo
train_datapipe
andtest_datapipe
are of typeIterDataPipe
The problem is this results in the following error
_Engine run is terminating due to exception: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterDataPipeSerializationWrapper() This may be caused multiple references to the same IterDataPipe. We recommend using
.fork()
if that is necessary. For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.The problem appears to be that the evaluator is running on the training dataset as part of the training loop which I assume results in the training iterator being accessed twice which isn't supported (https://github.com/pytorch/data/issues/45)
I've worked around this by only performing evaluation on the validation dataset, so I'll use the pattern from the footnote (https://pytorch.org/ignite/quickstart.html#id1) but I thought I should raise that of the two supported patterns only the one that is discouraged actually appears to work with the new
torchdata.datapipes