tagoyal / dae-factuality

27 stars 4 forks source link

How to process data with input length longer than 128? #1

Closed zide05 closed 3 years ago

zide05 commented 3 years ago

Thanks for your quick code and checkpoints release. Still i have encountered one problem when using your trained checkpoint to test my data.

I have notice that you set the default max length of input to be 128, when i test sample with input size longer than 128, the data preprocess code just ignore it rather than try to truncate (in utils.py).

        if len(tokens_input) > max_length:
            rejected_ex += 1
            # tokens_input = tokens_input[:max_length]
            continue

I try to comment line and change the code to:

        if len(tokens_input) > max_length:
            # rejected_ex += 1
            tokens_input = tokens_input[:max_length]
            # continue

but got assertion error length mismatched.

But in the domain of factuality check for summarization, the input document tends to be much more longer than 128, how can i fix this problem?

tagoyal commented 3 years ago

You could pass in the max_length according to you dataset to the function.

Note however, that the available models were trained on input, output pairs that were less than 128 tokens after concatenation. We have not tested that model on longer source articles and cannot comment on how well it would work on such test sets.

zide05 commented 3 years ago

The average token lengths of source document can even longer than 512, if only bypass rather than truncate the example which is longer than max_length, then most of the test set will be skipped. Why you don't truncate but choose to skip the long document example ?

tagoyal commented 3 years ago

This wasn't a problem with the datasets we considered; nearly all example had sentence pairs less than 128. In this paper, we didn't consider document sized samples.