Closed alphanlp closed 5 years ago
Why should it be limited to half of 512?
Why should it be limited to half of 512?
cause when do train, we have sentence embedding 0 and 1, but in a single sentence classification task ,we just embedding 0, if this get bad influence
You can just set the whole sequence to sentence 0. Create a DataProcessor class for your task and set the whole input sequence to text_a
, example:
class MyProcessor(DataProcessor):
# some other methods here
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = line[1]
label = line[0]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
Notice text_b=None
.
How should I do if I have not only a sentence, but a whole text?
I don't clearly understand how to extend BertForSequenceClassification
with my own dataset for training/evaluating.
I have a dataset consisting of text/label pairs, where text can have multiple sentences.
Just send in the whole text as one "sentence", the limit on a sequence length that can be sent at once to BERT is 512 tokens
Ok, thanks. One more question related to classification. BERT tokenizes my sentences pretty strange:
04/27/2019 16:08:32 - INFO - main - tokens: [CLS] @ bra ##yy ##yy ##ant Так акт ##иви ##ровала ##сь новая карта , ст ##ара ##я и была не ##ак ##тив ##на . [SEP]
Why are more than a half of the words a separated with #
? I mean, these words are on russian and many of them are splited to several parts with #
, though it is one word. Should this be fixed during training?
That's the WordPiece tokenization, its a way to match subwords when an out of vocabulary word is encountered. It's explained in the bert paper with references. It's as it should be.
Ok, thank you so much.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
hi, if i have a single sentence classification task, should the max length of sentence limited to half of 512, that is to say 256?