Preprocessing datasets - Githubissues

mlpen / Nystromformer

Apache License 2.0

356 stars 41 forks source link

Preprocessing datasets #9

Closed thomasw21 closed 3 years ago

thomasw21 commented 3 years ago

Hello,

I'm unclear on the pretraining procedures, in particular the preprocessing of the datasets.

Unless I'm mistaken https://github.com/mlpen/Nystromformer/blob/main/data-preprocessing/preprocess_data_512.py suggest we just put segments of size 512. I'm not sure I understand how SOP is defined in this case? Actually SOP doesn't seem to be used at all in the pretrain script.

yyxiongzju commented 3 years ago

@thomasw21, the preprocessing of the datasets is to segment the sequence into sequences with a fixed number of tokens. The preprocessed datasets have been put in the docker. You do not have to redo it on your own. @mlpen did not add the sentence order prediction (SOP) part in preprocess_data_512.py. He will include it when he is available.

mlpen commented 3 years ago

We reorganized the code implementation for experiments. The data processing code for BERT (MLM and SOP) is on https://github.com/mlpen/Nystromformer/blob/main/reorganized_code/BERT/dataset.py

thomasw21 commented 3 years ago

Thank you ! I'll take a look when I get the chance. I'll close this issue and re-open another one if I have more questions.