pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

Fields get misaligned in larger datasets #266

Open DenisPeskoff opened 6 years ago

DenisPeskoff commented 6 years ago

I have an issue that only occurs in larger datasets and hence seems to be a torch text bug/"feature".

I have fields that are related and have to be of the same length, and everything is properly stored in TorchText on loading. The text is of length 8. The corresponding numbers (confidences) to that text is also length 8.

Simply iterating over my train_iter causes the correspondence to get messed up.

First time running: _for batch in trainiter: text. shape confidence.shape

Batch 1: torch.Size([512, 8]) torch.Size([512, 8])

Batch 2: torch.Size([512, 10]) torch.Size([512, 10])

Second time running exact same code:
torch.Size([512, 15]) torch.Size([512, 16])

torch.Size([512, 19]) torch.Size([512, 21])

Doing this on my identical-in-format, but smaller in size val/dev sets does not cause this issue. I tried varying batch sizes, and this issue occurs with a batch size of 2. (no issue with just 1). The files aren't THAT big (55k lines for train. <20k for other 2).

Any idea as to how to resolve this issue?

DenisPeskoff commented 6 years ago

Here's relevant Field code:

for ex in json.load(f)['questions']: sentences = ex['sentences'] confidences = ex['confidences'] for i, s in enumerate(sentences): if (len(confidences[i]) != len(s)): raise ValueError(str(len(confidences[i])), str(len(s)), ex['qnum']) examples.append(Example.fromdict({ 'qnum': ex['qnum'], 'sent': i, 'text': s, 'page': ex['page'], 'confidence':confidences[i]

Confidences is stored in a custom field that handles padding for jagged batches: _class FloatTensorField(RawField): def init(self): super().init()

def preprocess(self, x):
    return [float(i) for i in x]

def process(self, batch, **kwargs):
    longest_row=0
    for row in batch:
        longest_row = max(longest_row, len(row))
    for row in batch:
        while len(row) < longest_row:
            row.append(0)
    return torch.FloatTensor(batch)_

And a line of the data look like this: {"questions": [{"qnum": 82005, "sentences": [["title", "character", "goes", "to", "rome", "", "after", "his", "presumed", "dead", "in", "the", "late", "", "while", "he", "main", "character", "imagine", "that", "he", "is", "the", "title", "monarchy", "and", "he's", "in", "rico", "y._v."], ["playing", "which", "family", "interrupt", "stage", "manager", "will", "rehearsing", "a", "plane", "is", "for", "ten", "point", "what", "a", "tally", "announcers", "fix", "character", "in", "search", "of", "an", "author"]], "confidences": [[0.7737882, 1, 0.5551082, 0.9999999, 0.5248953, 1, 1, 0.9753209, 0.9998516, 1, 0.9891226, 1, 1, 1, 0.999663, 0.9999864, 0.9977787, 1, 0.66838, 1, 1, 0.5352527, 0.8956456, 1, 0.8062871, 0.7695115, 0.5817718, 0.9975851, 0.9999992, 0.5294525], [0.8349707, 0.9976395, 1, 0.2920094, 0.8554499, 0.999689, 0.8970955, 1, 0.6108159, 0.4669197, 1, 0.901359, 0.7975582, 0.7070471, 1, 0.9982439, 0.9441589, 0.5693762, 0.711937, 0.5098747, 0.9606063, 1, 1, 1, 0.6871562]], "page": "Luigi_Pirandello"}

Sentences and confidences are always the same length

DenisPeskoff commented 6 years ago

I found a patch and isolated the issue:

class torchtext.data.BucketIterator(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=None, shuffle=None, sort=None, sort_within_batch=None)

Setting shuffle= False causes it to stay aligned. But still unsure as to WHY shuffle caused it to misalign.

jekbradbury commented 6 years ago

This sounds like a particularly unfortunate bug, but I also can't figure out why shuffle would be causing misalignment. If there's any chance you have a dataset or partial dataset and code snippet that exhibits the problem, I'd like to see it. Does the bug still happen with Iterator rather than BucketIterator?

DenisPeskoff commented 6 years ago

Simply running "for batch in train_iter" twice is enough to throw off order for the code.

The first time after load all lengths line up e.g. words are len 27. Confidences are len 27.

The second time it looks almost random in assignment: e.g. words are len 141. Confidence might be len 23.

Changing the sort_key from sort by len(words) to len(confidences) makes a big difference (most are now aligned). However, a handful of the batches are still randomly aligned (not off by 1 or something) so that didn't fully resolve the problem.

def sort_key(example): return (len(example.confidence)) vs return len(example.text)

Words are of custom type that's a trivial variation on the default Vocab (random numbers for unseen items rather than 0s).

Confidences are of custom type FloatTensor which is FloatTensor + padding.

class FloatTensorField(RawField): def init(self): super().init()

def preprocess(self, x):
    return [float(i) for i in x]

def process(self, batch, **kwargs):
    longest_row=0
    for row in batch:
        longest_row = max(longest_row, len(row))
    for row in batch:
        while len(row) < longest_row:
            row.append(0)
    return torch.FloatTensor(batch)  

I only tried BucketIterator. This is resolved for my purposes but could help debug for yours if you have a hypothesis. Probably some combination of padding + shuffling with a numerical field aligned to a text field.

nbrgr commented 4 years ago

I had the same problem with my batch function:

def batch_fun(batch):
    max_len = max([len(x) for x in batch])

    for example in batch:
        example += [0.0] * (max_len - len(example))

    return batch

Iterator and BucketIterator don't make copies of data before calling the batch preprocessing function you give in RawField, so when you pad your examples like this it ends up changing the examples within your dataset. So on the next epoch it takes examples already padded from the last epoch and re-batches them and ends up re-padding them. If you have multiple RawFields this can cause fields within your batch to have different lengths.

Changing my batch function to this:

def batch_fun(batch):
    max_len = max([len(x) for x in batch])
    batch_copy = []
    for example in batch:
        batch_copy.append(example[:] + [0.0] * (max_len - len(example)))

    return batch_copy

seemed to fix my issue.