google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

finetune preprocessing adding padding to the dataset error #137

Open joelsprunger opened 1 year ago

joelsprunger commented 1 year ago

I downloaded the pretrained small model and was trying to fine tune for question/answer using "squad"

Here is where I am running into trouble

# add padding so the dataset is a multiple of batch_size while n_examples % batch_size != 0: writer.write(self._make_tf_example(task_id=len(self._config.task_names)) .SerializeToString())

The above _make_tf_example() call is throwing the following error...

Traceback (most recent call last): File "/Users/joelsprunger/Documents/electra/finetune/preprocessing.py", line 111, in serialize_examples writer.write(self._make_tf_example(task_id=len(self._config.task_names)) File "/Users/joelsprunger/Documents/electra/finetune/preprocessing.py", line 141, in _make_tf_example value=list(values))) TypeError: array(0) has type numpy.ndarray, but expected one of: int

When I debug inside this call it looks like the _feature_spec.name == 'squad_eid' is returning array(0) rather than a list of zeros for the following line.

values = spec.get_default_values()

Not sure if this is a bug, or I have done something wrong.

joelsprunger commented 1 year ago

I tried using debug = true and got the same results with a paired down train-debug.json. Perhaps it is an issue with python package versioning. I created a requirements.txt for this and I only fixed the version on tf. Perhaps new versions of numpy or something else is causing this bug.

tensorflow==1.15 numpy scikit-learn scipy