a single sentence classification task, should the max length of sentence limited to half of 512, that is to say 256

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.73k stars 26.94k forks source link

a single sentence classification task, should the max length of sentence limited to half of 512, that is to say 256 #372

Closed alphanlp closed 5 years ago

alphanlp commented 5 years ago

hi, if i have a single sentence classification task, should the max length of sentence limited to half of 512, that is to say 256?

thomwolf commented 5 years ago

Why should it be limited to half of 512?

alphanlp commented 5 years ago

Why should it be limited to half of 512?

cause when do train, we have sentence embedding 0 and 1, but in a single sentence classification task ,we just embedding 0, if this get bad influence

maxlund commented 5 years ago

You can just set the whole sequence to sentence 0. Create a DataProcessor class for your task and set the whole input sequence to text_a, example:

class MyProcessor(DataProcessor):
# some other methods here

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = line[1]
            label = line[0]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

Notice text_b=None.

joistick11 commented 5 years ago

How should I do if I have not only a sentence, but a whole text? I don't clearly understand how to extend BertForSequenceClassification with my own dataset for training/evaluating. I have a dataset consisting of text/label pairs, where text can have multiple sentences.

maxlund commented 5 years ago

Just send in the whole text as one "sentence", the limit on a sequence length that can be sent at once to BERT is 512 tokens

joistick11 commented 5 years ago

Ok, thanks. One more question related to classification. BERT tokenizes my sentences pretty strange:

04/27/2019 16:08:32 - INFO - main - tokens: [CLS] @ bra ##yy ##yy ##ant Так акт ##иви ##ровала ##сь новая карта , ст ##ара ##я и была не ##ак ##тив ##на . [SEP]

Why are more than a half of the words a separated with #? I mean, these words are on russian and many of them are splited to several parts with #, though it is one word. Should this be fixed during training?

maxlund commented 5 years ago

That's the WordPiece tokenization, its a way to match subwords when an out of vocabulary word is encountered. It's explained in the bert paper with references. It's as it should be.

joistick11 commented 5 years ago

Ok, thank you so much.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.