Open cuijianaaa opened 3 months ago
Thanks for the report @cuijianaaa , I don't think this is a bug, this is expected behaviour. The length is 3 because each data sample has 3 elements. It's maybe easier to understand when the samples are dicts: before collate:
[{"a": 1, "b": 2.5, "c": "string1"}, {"a": 2, "b": 4.5, "c": "string2"}]
After collate you get:
{"a": [1, 2], "b": [2.5, 4.5], "c": ["string1", "string2"]}
If a and b are features, and c is the target, this is the easiest format to pass to your model.
Keep in mind that you can pass a custom collate_fn to collate your data any way you like, if this doesn't suit your training set up.
Thanks for the report @cuijianaaa , I don't think this is a bug, this is expected behaviour. The length is 3 because each data sample has 3 elements. It's maybe easier to understand when the samples are dicts: before collate:
[{"a": 1, "b": 2.5, "c": "string1"}, {"a": 2, "b": 4.5, "c": "string2"}]
After collate you get:
{"a": [1, 2], "b": [2.5, 4.5], "c": ["string1", "string2"]}
If a and b are features, and c is the target, this is the easiest format to pass to your model.
Thank you very much for the detailed answer! But I feel that the situation with lists and dicts is not very similar. For example, when I maintain a sequence of data, each sample is a sequence of strings (if it is a number, there is no such problem because numbers can be spelled as tensors, therefore, we often use lists to process strings because strings cannot be stored as tensors): before collate:
[['file_name1', 'file_name2'], ['file_name3', 'file_name4']]
after collate:
[('file_name1', 'file_name3'), ('file_name2', 'file_name4')]
I feel like this is a very strange behavior, and users wouldn't want to use it like this... The main issue is that the string situation is quite unique, and I think there may be several improvement suggestions:
🐛 Describe the bug
When we use default collate function in https://github.com/pytorch/pytorch/blob/217aac96d779841666527402fc113493b6bd6323/torch/utils/data/_utils/collate.py#L171 If we have two data samples as (1,2,3), (2,3,4), and batch_size = 2 after collate to batch: [(1,2), (2,3), (3,4)] and the len(batch) is three..., but we only have two samples, we always use the first dim as batch dim, we design like this, it's so strange.. I think maybe we needn't to transpose this, and return the raw data when it's sequence type
Versions
2.0.0
cc @andrewkho @gokulavasan @SsnL @VitalyFedyunin @dzhulgakov