google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

build_pretraining_dataset.py #80

Closed shinhyeokoh closed 3 years ago

shinhyeokoh commented 4 years ago

If line.strip() is used in write_examples function, the 'if not line' condition of the add_line function does not work. I think you should remove strip() in write_examples function.


def write_examples(self, input_file): """Writes out examples from the provided input file.""" with tf.io.gfile.GFile(input_file) as f: for line in f: line = line.strip() if line or self._blanks_separate_docs: example = self._example_builder.add_line(line)

def add_line(self, line): """Adds a line of text to the current example being built.""" line = line.strip().replace("\n", " ") if (not line) and self._current_length != 0: # empty lines separate docs return self._create_example()

shinhyeokoh commented 3 years ago

Later on, I think they did it on purpose.