Open wayfarerjing opened 5 years ago
@wayfarerjing I had the same problem, were you able to solve this issue?
@DragonAndSky I had the same problem, were you able to solve this issue?
No I didn't manage to solve it. I think maybe it is expected.
maybe the time cost is not mainly from saving checkpoints. It may come from predicting and evaluating.
Similar situation encountered here. I have been running run_classifier on MRPC dataset, using codes provided in the readme, however it seems it takes longer than expected
Exactly, the similar situation. I thought its expected, and I never open the issue ticket.
What level of accuracy do you get? I am trying with aclImdb (25K documents for training) and I am getting very bad results. Could you share the configuration and the BERT base model you're using?
What level of accuracy do you get? I am trying with aclImdb (25K documents for training) and I am getting very bad results. Could you share the configuration and the BERT base model you're using?
as bad as 0.413....
I am getting 0.5, just chance. I am not sure what I am doing wrong. I am using 256 as max seq length and the rest of the parameters are all default. I have also tried 512 max seq length on TPU, but same results. I will do some debugging on the data processor.
I am getting 0.5, just chance. I am not sure what I am doing wrong. I am using 256 as max seq length and the rest of the parameters are all default. I have also tried 512 max seq length on TPU, but same results. I will do some debugging on the data processor.
Yup, data processor seems to be error-prone but I did some debugging and run again, I got 1..... that is sort of .....impossible.
Besides, did u try the function, do predict, I don't quite understand the output, it contains two columns which the sum of them is 1....
As I am doing binary classification, does it suggests the probability of being two different classes respectively??
Can you share your ImdbProcessor and the command you use to run the classifier when you get 1 as accuracy? I havent checked the predict function, but if the output is two columns that sum to one, that should be the probability of each class.
I was doing classification for another dataset, instead of Imdb. But more and less it should be similar. ''' class AnyhowProcessor(DataProcessor): """Processor for the Project"""
def __init__(self):
self.language = "en"
def get_train_examples(self, data_dir):
file_path = os.path.join(data_dir, 'train.csv')
with open(file_path, 'r') as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split(',')
text_a = tokenization.convert_to_unicode(split_line[1])
text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[0]
# print(text_a,"##########",text_b, "************",label)
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
def get_dev_examples(self, data_dir):
"""See base class."""
file_path = os.path.join(data_dir, 'val.csv')
with open(file_path, 'r') as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split(',',1)
text_a = tokenization.convert_to_unicode(split_line[1])
# text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[0]
# print(text_a,"##########",text_b, "************",label)
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=None, label=label))
return examples
def get_test_examples(self, data_dir):
"""See base class."""
file_path = os.path.join(data_dir, 'test.csv')
with open(file_path, 'r') as f:
reader = f.readlines()
examples = []
for index, line in enumerate(reader):
guid = 'train-%d' % index
split_line = line.strip().split(',',1)
text_a = tokenization.convert_to_unicode(split_line[1])
# text_b = tokenization.convert_to_unicode(split_line[2])
label = split_line[0]
label = re.sub(r'[^\w]', '', label)
print(text_a,"##########", "************",label)
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=None, label=label))
return examples
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, i)
text_a = tokenization.convert_to_unicode(line[3])
text_b = tokenization.convert_to_unicode(line[4])
if set_type == "test":
label = "0"
else:
label = tokenization.convert_to_unicode(line[0])
print(label)
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
print(examples)
return examples
'''
What's the dataset you're working on?
i got from kaggle toxic comment
Sent from my iPhone
On 12 Jan 2019, at 00:52, Francesco Ciannella notifications@github.com<mailto:notifications@github.com> wrote:
What's the dataset you're working on?
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/google-research/bert/issues/212#issuecomment-453582405, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aab_Rzl2Y4VfbO9Nj_Yt529H3amg7xyLks5vCMFDgaJpZM4Y9Av3.
Can you share your ImdbProcessor and the command you use to run the classifier when you get 1 as accuracy? I havent checked the predict function, but if the output is two columns that sum to one, that should be the probability of each class.
@fciannel I am also getting the same results as you. Here's my Imdb Processor:
class ImdbProcessor(ColaProcessor):
def _create_examples(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
# skip headers
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
if set_type == 'test':
text_a = tokenization.convert_to_unicode(line[2])
# the test set we pass to this processor will also be labeled
label = tokenization.convert_to_unicode(line[1])
else:
text_a = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label)
)
I doubt it is a problem with the data processor since I used a similar data processor and got satisfactory results.
@hsm207 I have been looking into the imdb dataset. I thought it might be the length of the examples. For instance when I tried with the quora insincere question, I got great results and in that dataset there is just one sentence per example and very short. In imdb the sentences are quite long. I tried though training on sentences that are 256 tokens, but get the same result.
What are the datasets that were successful for you?
@fciannel The dataset that was successful for me is a proprietary dataset so I can't talk much about it, other than saying it is a binary text classification task.
Anyway, I think I have found the source of the poor performance. The imdb dataset I am working with has all the examples labeled '1' appear one after another followed by all the examples labeled '0'. Although the input_fn has a shuffling step during training, its buffer size is only 100, so I suspect the model is being fed all examples labeled '1' followed by all examples labeled '0'. My solution was to have a shuffle step in the data processor. Here's my updated imdb processor:
class ImdbProcessor(ColaProcessor):
def _create_examples(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
# skip headers
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
if set_type == 'test':
text_a = tokenization.convert_to_unicode(line[2])
# the test set we pass to this processor will also be labeled
label = tokenization.convert_to_unicode(line[1])
else:
text_a = tokenization.convert_to_unicode(line[2])
label = tokenization.convert_to_unicode(line[1])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label)
)
# the training set is sorted such that all the '1's (labels) are arranged one after another followed
# by all the '0's.
# since the buffer size for the shuffle step is only 100 and batch size is relatively small e.g. 32, the model
# ends up being fed with all '1' examples followed by all '0' examples.
# shuffling at this stage ensures that each batch has a mix of '1's and '0's.
if set_type == 'train':
random.shuffle(examples)
return examples
Please let me know how it goes for you.
@hsm207 you nailed it :) thanks very much.
I get this at a first attempt, with default parameters and 128 max seq length:
INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8435162 INFO:tensorflow: eval_loss = 0.4801063 INFO:tensorflow: global_step = 2343 INFO:tensorflow: loss = 0.47997266
but I can bring it up with longer sentences. Will try on TPU later and let you know.
@hsm207 you nailed it :) thanks very much.
I get this at a first attempt, with default parameters and 128 max seq length:
INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8435162 INFO:tensorflow: eval_loss = 0.4801063 INFO:tensorflow: global_step = 2343 INFO:tensorflow: loss = 0.47997266
but I can bring it up with longer sentences. Will try on TPU later and let you know.
Is this the result for Imdb dataset after you applied shuffling? How much has it improved?
This is what I get if I use 512 as max length for training and evaluating on TPU with 32 batch size for training and 8 for predicting:
Eval results eval_accuracy = 0.9312 eval_loss = 0.5297534 global_step = 4000 loss = 0.4339881
I think I can still bring it up if I do a better job in cleaning up the dataset / work on the hyperparameters. That result is not impressive on this dataset.
@wayfarerjing yes this is after applying the random shuffling inside the data processor as suggested by @hsm207. It went to .93 from chance.
@wayfarerjing @fciannel
This is the best result I got thus far:
Hyperparameters | |
---|---|
max sequence length | 512 |
batch Size | 8 |
learning Rate | 3.00E-05 |
number of epochs | 3 |
Results | |
eval loss | 0.3292 |
eval accuracy | 0.9407 |
Do you think its worthwhile fine-tuning with the same hyperparameters but longer epochs e.g. up to 10?
The state-of-the-art right now is 95.4 using ULMFiT.
@hsm207 you nailed it :) thanks very much.
I get this at a first attempt, with default parameters and 128 max seq length:
INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 0.8435162 INFO:tensorflow: eval_loss = 0.4801063 INFO:tensorflow: global_step = 2343 INFO:tensorflow: loss = 0.47997266
but I can bring it up with longer sentences. Will try on TPU later and let you know.
I shall work on imdb dataset. However, for my data set, I have it on my own and I am really sorry as I could not release it to public yet.
This is the result for my dataset, however, once I do prediction, all the output is some sort of biased. INFO:tensorflow: Eval results INFO:tensorflow: eval_accuracy = 1.0 INFO:tensorflow: eval_loss = 3.9339047e-06 INFO:tensorflow: global_step = 4563
like this:
0.6363035 | 0.36369655 |
---|---|
0.6363035 | 0.36369655 |
0.6363035 | 0.36369655 |
0.6363035 | 0.36369655 |
0.6363035 | 0.36369655 |
0.6363035 | 0.36369655 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
0.9999999 | 6.4646684E-08 |
INFO:tensorflow: loss = 3.9339047e-06
I'm running BERT_bataset on IMDB movie review dataset with 60000+ training reviews. When running on gpu (K80) it takes really a long time to save checkpoints (~2 hours and it is only half way so far ) while running on MRPC dataset (5000 pairs of texts) only takes 9 mins. Is this slow speed anticipated?