[DataPipe] Ensure all DataPipes Meet Testing Requirements

NivekT commented 3 years ago

🚀 Feature

We have many tests for existing DataPipes (both in PyTorch Core and TorchData). However, over time, they have become less organized. Moreover, as the testing requirements expand, older DataPipes may not have tests to cover the newly added requirements.

This issue aims to track the status of tests for all DataPipes.

Motivation

We want to ensure test coverage for all DataPipe is complete to reduce bugs and unexpected behavior.

Alternative

We also should create some testing templates for IterDataPipe and MapDataPipe that can be widely applied.

`IterDataPipe` Tracker

X - Done NA - Not Applicable Blank - Not Done/Unclear

Test definitions: Functional - unit test to ensure that the DataPipe works properly with various input arguments Reset - DataPipe can be reset/restart after being read __len__ - the __len__ method is implemented whenever possible (or explicitly not implemented) Serializable - DataPipe is serializable Graph (future) - can be traversed as part of a DataPipe graph Snapshot (future) - can be saved/loaded as a checkpoint/snapshot

Name	Module	Functional Test	Reset	`__len__`	Serializable (Pickable)
Batcher	Core	X	X	X	X
Collator	Core	X	X	X	X
Concater	Core	X	X	X	X
Demultiplexer	Core	X	X	X	X
FileLister	Core	X	X	X	X
FileOpener	Core	X	X	X	X
Filter	Core	X	X	X	X
Forker	Core	X	X	X	X
Grouper	Core	X	X		X
IterableWrapper	Core	X	X	X	X
Mapper	Core	X	X	X	X
Multiplexer	Core	X	X	X	X
RoutedDecoder	Core	X	X	X	X
Sampler	Core	X	X	X	X
Shuffler	Core	X	X	X	X
StreamReader	Core	X	X	X	X
UnBatcher	Core	X	X		X
Zipper	Core	X	X	X	X
BucketBatcher	Data	X	X	X	X
CSVDictParser	Data	X	X	X	X
CSVParser	Data	X	X	X	X
Cycler	Data	X	X	X	X
DataFrameMaker	Data	X	X	X	X
Decompressor	Data	X	X	X	X
Enumerator	Data	X	X	X	X
FlatMapper	Data	X	X	X	X
FSSpecFileLister	Data	X	X	X	X
FSSpecFileOpener	Data	X	X	X	X
FSSpecSaver	Data	X	X	X	X
GDriveReader	Data	X	X	X	X
HashChecker	Data	X	X	X	X
Header	Data	X	X	X	X
HttpReader	Data	X	X	X	X
InMemoryCacheHolder	Data	X	X	X	X
IndexAdder	Data	X	X	X	X
IoPathFileLister	Data	X	X	X	X
IoPathFileOpener	Data	X	X	X	X
IoPathSaver	Data	X	X	X	X
IterKeyZipper	Data	X	X	X	X
JsonParser	Data	X	X	X	X
LineReader	Data	X	X	X	X
MapKeyZipper	Data	X	X	X	X
OnDiskCacheHolder	Data	X	X	X	X
OnlineReader	Data	X	X	X	X
ParagraphAggregator	Data	X	X	X	X
ParquetDataFrameLoader	Data	X	X	X	X
RarArchiveLoader	Data	X	X	X	X
Rows2Columnar	Data	X	X	X	X
SampleMultiplexer	Data	X	X	X	X
Saver	Data	X	X	X	X
TarArchiveLoader	Data	X	X	X	X
UnZipper	Data	X	X	X	X
XzFileLoader	Data	X	X	X	X
ZipArchiveLoader	Data	X	X	X	X

`MapDataPipe` Tracker

X - Done NA - Not Applicable Blank - Not Done/Unclear

Name	Module	Functional Test	`__len__`	Serializable (Pickable)
Batcher	Core	X	X
Concater	Core	X	X
Mapper	Core	X	X	X
SequenceWrapper	Core	X	X	X
Shuffler	Core	X	X
Zipper	Core	X	X

cc: @ejguan @VitalyFedyunin @NivekT

ejguan commented 3 years ago

This is awesome. One nit note: serializable should be same as picklable IMO.

ejguan commented 2 years ago

@NivekT I am concerning about when and how we want to do graph testing. For a single DataPipe instance, the graph testing makes no sense. Then, we may want to construct a datapipe graph by ourselves. The problem is how we can guarantee the testing coverage for all use cases.

VitalyFedyunin commented 2 years ago

We can require each DataPipe to introduce a simple usage example graph for this purpose.

ejguan commented 2 years ago

When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.

Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.

NivekT commented 2 years ago

When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.

Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.

Agreed that the OpInfo-like way is probably the best. I think our inputs and necessary setup for each test is a bit all over the place. Having tests split between two repos doesn't help either.

pytorch / data