Fine-tuning takes a long time to save checkpoints

wayfarerjing commented 5 years ago

I'm running BERT_bataset on IMDB movie review dataset with 60000+ training reviews. When running on gpu (K80) it takes really a long time to save checkpoints (~2 hours and it is only half way so far ) while running on MRPC dataset (5000 pairs of texts) only takes 9 mins. Is this slow speed anticipated?

DragonAndSky commented 5 years ago

@wayfarerjing I had the same problem, were you able to solve this issue?

wayfarerjing commented 5 years ago

@DragonAndSky I had the same problem, were you able to solve this issue?

No I didn't manage to solve it. I think maybe it is expected.

Jinxi2 commented 5 years ago

maybe the time cost is not mainly from saving checkpoints. It may come from predicting and evaluating.

PaulZhangIsing commented 5 years ago

Similar situation encountered here. I have been running run_classifier on MRPC dataset, using codes provided in the readme, however it seems it takes longer than expected

jageshmaharjan commented 5 years ago

Exactly, the similar situation. I thought its expected, and I never open the issue ticket.

fciannella commented 5 years ago

What level of accuracy do you get? I am trying with aclImdb (25K documents for training) and I am getting very bad results. Could you share the configuration and the BERT base model you're using?

PaulZhangIsing commented 5 years ago

What level of accuracy do you get? I am trying with aclImdb (25K documents for training) and I am getting very bad results. Could you share the configuration and the BERT base model you're using?

as bad as 0.413....

fciannella commented 5 years ago

I am getting 0.5, just chance. I am not sure what I am doing wrong. I am using 256 as max seq length and the rest of the parameters are all default. I have also tried 512 max seq length on TPU, but same results. I will do some debugging on the data processor.

PaulZhangIsing commented 5 years ago

I am getting 0.5, just chance. I am not sure what I am doing wrong. I am using 256 as max seq length and the rest of the parameters are all default. I have also tried 512 max seq length on TPU, but same results. I will do some debugging on the data processor.

Yup, data processor seems to be error-prone but I did some debugging and run again, I got 1..... that is sort of .....impossible.

Besides, did u try the function, do predict, I don't quite understand the output, it contains two columns which the sum of them is 1....

As I am doing binary classification, does it suggests the probability of being two different classes respectively??

fciannella commented 5 years ago

Can you share your ImdbProcessor and the command you use to run the classifier when you get 1 as accuracy? I havent checked the predict function, but if the output is two columns that sum to one, that should be the probability of each class.

PaulZhangIsing commented 5 years ago

I was doing classification for another dataset, instead of Imdb. But more and less it should be similar. ''' class AnyhowProcessor(DataProcessor): """Processor for the Project"""

def __init__(self):
    self.language = "en"

def get_train_examples(self, data_dir):
    file_path = os.path.join(data_dir, 'train.csv')
    with open(file_path, 'r') as f:
        reader = f.readlines()
    examples = []
    for index, line in enumerate(reader):
        guid = 'train-%d' % index
        split_line = line.strip().split(',')
        text_a = tokenization.convert_to_unicode(split_line[1])
        text_b = tokenization.convert_to_unicode(split_line[2])
        label = split_line[0]
        # print(text_a,"##########",text_b, "************",label)
        examples.append(InputExample(guid=guid, text_a=text_a,
                                     text_b=text_b, label=label))
    return examples

def get_dev_examples(self, data_dir):
    """See base class."""
    file_path = os.path.join(data_dir, 'val.csv')
    with open(file_path, 'r') as f:
        reader = f.readlines()
    examples = []
    for index, line in enumerate(reader):
        guid = 'train-%d' % index
        split_line = line.strip().split(',',1)
        text_a = tokenization.convert_to_unicode(split_line[1])
        # text_b = tokenization.convert_to_unicode(split_line[2])
        label = split_line[0]
        # print(text_a,"##########",text_b, "************",label)
        examples.append(InputExample(guid=guid, text_a=text_a,
                                     text_b=None, label=label))
    return examples

def get_test_examples(self, data_dir):
    """See base class."""
    file_path = os.path.join(data_dir, 'test.csv')
    with open(file_path, 'r') as f:
        reader = f.readlines()
    examples = []
    for index, line in enumerate(reader):
        guid = 'train-%d' % index
        split_line = line.strip().split(',',1)
        text_a = tokenization.convert_to_unicode(split_line[1])
        # text_b = tokenization.convert_to_unicode(split_line[2])
        label = split_line[0]
        label = re.sub(r'[^\w]', '', label)
        print(text_a,"##########", "************",label)
        examples.append(InputExample(guid=guid, text_a=text_a,
                                     text_b=None, label=label))
    return examples

def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
        if i == 0:
            continue
        guid = "%s-%s" % (set_type, i)
        text_a = tokenization.convert_to_unicode(line[3])
        text_b = tokenization.convert_to_unicode(line[4])
        if set_type == "test":
            label = "0"
        else:
            label = tokenization.convert_to_unicode(line[0])
        print(label)
        examples.append(
            InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    print(examples)
    return examples

'''

fciannella commented 5 years ago

What's the dataset you're working on?

PaulZhangIsing commented 5 years ago

i got from kaggle toxic comment

Sent from my iPhone

On 12 Jan 2019, at 00:52, Francesco Ciannella notifications@github.com<mailto:notifications@github.com> wrote:

What's the dataset you're working on?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/212#issuecomment-453582405, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aab_Rzl2Y4VfbO9Nj_Yt529H3amg7xyLks5vCMFDgaJpZM4Y9Av3.

hsm207 commented 5 years ago

Can you share your ImdbProcessor and the command you use to run the classifier when you get 1 as accuracy? I havent checked the predict function, but if the output is two columns that sum to one, that should be the probability of each class.

@fciannel I am also getting the same results as you. Here's my Imdb Processor:

class ImdbProcessor(ColaProcessor):
    def _create_examples(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            # skip headers
            if i == 0:
                continue

            guid = "%s-%s" % (set_type, line[0])
            if set_type == 'test':
                text_a = tokenization.convert_to_unicode(line[2])
                # the test set we pass to this processor will also be labeled
               label = tokenization.convert_to_unicode(line[1])
            else:
                text_a = tokenization.convert_to_unicode(line[2])
                label = tokenization.convert_to_unicode(line[1])

            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label)
            )

I doubt it is a problem with the data processor since I used a similar data processor and got satisfactory results.

fciannella commented 5 years ago

@hsm207 I have been looking into the imdb dataset. I thought it might be the length of the examples. For instance when I tried with the quora insincere question, I got great results and in that dataset there is just one sentence per example and very short. In imdb the sentences are quite long. I tried though training on sentences that are 256 tokens, but get the same result.

What are the datasets that were successful for you?

hsm207 commented 5 years ago

@fciannel The dataset that was successful for me is a proprietary dataset so I can't talk much about it, other than saying it is a binary text classification task.

Anyway, I think I have found the source of the poor performance. The imdb dataset I am working with has all the examples labeled '1' appear one after another followed by all the examples labeled '0'. Although the input_fn has a shuffling step during training, its buffer size is only 100, so I suspect the model is being fed all examples labeled '1' followed by all examples labeled '0'. My solution was to have a shuffle step in the data processor. Here's my updated imdb processor:

class ImdbProcessor(ColaProcessor):
    def _create_examples(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            # skip headers
            if i == 0:
                continue

            guid = "%s-%s" % (set_type, line[0])
            if set_type == 'test':
                text_a = tokenization.convert_to_unicode(line[2])
                # the test set we pass to this processor will also be labeled
                label = tokenization.convert_to_unicode(line[1])
            else:
                text_a = tokenization.convert_to_unicode(line[2])
                label = tokenization.convert_to_unicode(line[1])

            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label)
            )

        # the training set is sorted such that all the '1's (labels) are arranged one after another followed 
        # by all the '0's.
        # since the buffer size for the shuffle step is only 100 and batch size is relatively small e.g. 32, the model
        # ends up being fed with all '1' examples followed by all '0' examples.
        # shuffling at this stage ensures that each batch has a mix of '1's and '0's.
        if set_type == 'train':
            random.shuffle(examples)
        return examples

Please let me know how it goes for you.

fciannella commented 5 years ago

@hsm207 you nailed it :) thanks very much.

I get this at a first attempt, with default parameters and 128 max seq length:

INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8435162 INFO:tensorflow: eval_loss = 0.4801063 INFO:tensorflow: global_step = 2343 INFO:tensorflow: loss = 0.47997266

but I can bring it up with longer sentences. Will try on TPU later and let you know.

wayfarerjing commented 5 years ago

@hsm207 you nailed it :) thanks very much.

I get this at a first attempt, with default parameters and 128 max seq length:

INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8435162 INFO:tensorflow: eval_loss = 0.4801063 INFO:tensorflow: global_step = 2343 INFO:tensorflow: loss = 0.47997266

but I can bring it up with longer sentences. Will try on TPU later and let you know.

Is this the result for Imdb dataset after you applied shuffling? How much has it improved?

fciannella commented 5 years ago

This is what I get if I use 512 as max length for training and evaluating on TPU with 32 batch size for training and 8 for predicting:

Eval results eval_accuracy = 0.9312 eval_loss = 0.5297534 global_step = 4000 loss = 0.4339881

I think I can still bring it up if I do a better job in cleaning up the dataset / work on the hyperparameters. That result is not impressive on this dataset.

@wayfarerjing yes this is after applying the random shuffling inside the data processor as suggested by @hsm207. It went to .93 from chance.

hsm207 commented 5 years ago

@wayfarerjing @fciannel

This is the best result I got thus far:

Hyperparameters
max sequence length	512
batch Size	8
learning Rate	3.00E-05
number of epochs	3

Results
eval loss	0.3292
eval accuracy	0.9407

Do you think its worthwhile fine-tuning with the same hyperparameters but longer epochs e.g. up to 10?

The state-of-the-art right now is 95.4 using ULMFiT.

PaulZhangIsing commented 5 years ago

@hsm207 you nailed it :) thanks very much.

I get this at a first attempt, with default parameters and 128 max seq length:

INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8435162 INFO:tensorflow: eval_loss = 0.4801063 INFO:tensorflow: global_step = 2343 INFO:tensorflow: loss = 0.47997266

but I can bring it up with longer sentences. Will try on TPU later and let you know.

I shall work on imdb dataset. However, for my data set, I have it on my own and I am really sorry as I could not release it to public yet.

This is the result for my dataset, however, once I do prediction, all the output is some sort of biased. INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 1.0 INFO:tensorflow: eval_loss = 3.9339047e-06 INFO:tensorflow: global_step = 4563

like this:

0.6363035	0.36369655
0.6363035	0.36369655
0.6363035	0.36369655
0.6363035	0.36369655
0.6363035	0.36369655
0.6363035	0.36369655
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08
0.9999999	6.4646684E-08

INFO:tensorflow: loss = 3.9339047e-06

google-research / bert

Fine-tuning takes a long time to save checkpoints #212