pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.7k stars 22.27k forks source link

Why need to transpose when collate a sequence data in dataloader? #129225

Open cuijianaaa opened 3 months ago

cuijianaaa commented 3 months ago

🐛 Describe the bug

When we use default collate function in https://github.com/pytorch/pytorch/blob/217aac96d779841666527402fc113493b6bd6323/torch/utils/data/_utils/collate.py#L171 If we have two data samples as (1,2,3), (2,3,4), and batch_size = 2 after collate to batch: [(1,2), (2,3), (3,4)] and the len(batch) is three..., but we only have two samples, we always use the first dim as batch dim, we design like this, it's so strange.. I think maybe we needn't to transpose this, and return the raw data when it's sequence type

Versions

2.0.0

cc @andrewkho @gokulavasan @SsnL @VitalyFedyunin @dzhulgakov

andrewkho commented 3 months ago

Thanks for the report @cuijianaaa , I don't think this is a bug, this is expected behaviour. The length is 3 because each data sample has 3 elements. It's maybe easier to understand when the samples are dicts: before collate:

[{"a": 1, "b": 2.5, "c": "string1"}, {"a": 2, "b": 4.5, "c": "string2"}]

After collate you get:

{"a": [1, 2], "b": [2.5, 4.5], "c": ["string1", "string2"]}

If a and b are features, and c is the target, this is the easiest format to pass to your model.

andrewkho commented 3 months ago

Keep in mind that you can pass a custom collate_fn to collate your data any way you like, if this doesn't suit your training set up.

cuijianaaa commented 3 months ago

Thanks for the report @cuijianaaa , I don't think this is a bug, this is expected behaviour. The length is 3 because each data sample has 3 elements. It's maybe easier to understand when the samples are dicts: before collate:

[{"a": 1, "b": 2.5, "c": "string1"}, {"a": 2, "b": 4.5, "c": "string2"}]

After collate you get:

{"a": [1, 2], "b": [2.5, 4.5], "c": ["string1", "string2"]}

If a and b are features, and c is the target, this is the easiest format to pass to your model.

Thank you very much for the detailed answer! But I feel that the situation with lists and dicts is not very similar. For example, when I maintain a sequence of data, each sample is a sequence of strings (if it is a number, there is no such problem because numbers can be spelled as tensors, therefore, we often use lists to process strings because strings cannot be stored as tensors): before collate:

[['file_name1', 'file_name2'], ['file_name3', 'file_name4']]

after collate:

[('file_name1', 'file_name3'), ('file_name2', 'file_name4')]

I feel like this is a very strange behavior, and users wouldn't want to use it like this... The main issue is that the string situation is quite unique, and I think there may be several improvement suggestions:

  1. Modify the collate method of lists and tuples, as most of the time we use lists and tuples to handle strings rather than numbers
  2. Support storing strings as tensors
  3. Support passing in default_collate_fn_map [list]
  4. Or as you mentioned using custom collate_fn, but in fact, users may prefer most of the features to be the same as default, except for modifying the behavior of the list. This requires copying a large amount of code from the default collate, which feels difficult to maintain. Therefore, users may hope that the default behavior of the list and tuple is to support most usage scenarios, or to support storing strings as tensors