tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.65k stars 1.65k forks source link

tensorflow.python.framework.errors_impl.InvalidArgumentError: events.out.tfevents. Invalid argument #6398

Open manitadayon opened 1 year ago

manitadayon commented 1 year ago

Hi I am using Pytorch forecasting package however my code throws an error in the training phase, even during the first epoch. Here is the system configuration:

During the training my code gives the following error:

The traceback:

"/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
python3.10/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
python3.10/site-packages/tensorflow/python/lib/io/file_io.py", line 101, in write
    self._writable_file.append(
tensorflow.python.framework.errors_impl.InvalidArgumentError:events.out.tfevent;  Invalid argument

I do not think there is any problem with data or model however every time I see the above traceback while training and was wondering what causes this issue and how to fix it.

yatbear commented 1 year ago

Hi @manitadayon,

Can you share a snippet of the actual (or example) summary writing in your code so that we can further troubleshoot? For example the tf.summary.scalar parts.

yatbear commented 1 year ago

Adding another note - at first glance (without the actual code to debug) this might have something to do with the sanity of your training data. Please consider double checking whether there are any negative dimension values (or other out-of-bound data errors).

manitadayon commented 1 year ago

@yatbear thanks for the quick reply. This is my code:

max_encoder_length = 10
max_prediction_length = 10

context_length = max_encoder_length
prediction_length = max_prediction_length

training = TimeSeriesDataSet(
     train_data,
    time_idx="Time",
    target="Value",
    group_ids=["Series"],
    time_varying_unknown_reals=["Value"],
    max_encoder_length=context_length,
    max_prediction_length=prediction_length
)

batch_size = 64
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)

trainer = pl.Trainer(accelerator="auto",  max_epochs= 5)

network = NBeats.from_dataset(
    training,
    learning_rate=1e-5,
    log_interval=10,
    widths=[32, 512],
    backcast_loss_ratio=1.0,
)

trainer.fit(
    network,
    train_dataloaders=train_dataloader,
)

Observation: Training gets stuck at 60%, the actual percentage varies depending on the batch_size (I do not know why).

The traceback is:

"/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
python3.10/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
python3.10/site-packages/tensorflow/python/lib/io/file_io.py", line 101, in write
    self._writable_file.append(
tensorflow.python.framework.errors_impl.InvalidArgumentError:events.out.tfevent;  Invalid argument

I am not sure how this can be related to training data, Do you know if there is any way to get more info on this?

yatbear commented 1 year ago

I'm not sure what kind of data is in train_data (used to construct TimeSeriesDataSet), can you examine the data inside?

Also:

train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)

is num_workers=0 intended?

I see that you are using PyTorch Lighting to orchestrate this, even though PL uses TF summary API under the hood for log metrics, I think this error in particular has something to do with how the training data and trainer is set up, or how the metrics are encoded and parsed, please file an issue under https://github.com/Lightning-AI/lightning/issues or https://github.com/pytorch/pytorch/issues for PyTorch people to TAL, thanks!

manitadayon commented 1 year ago

Sure, I will refer it to them, yes the num_workers=0 intended since anything more than 1 issues some other warnings. About the data, is there any specific issue you are looking? like missing values, or anything like that. (Like data has no missing value or any issue that I would think about, data is multiple time series data concatenated as a form of big data frame in long format.

yatbear commented 1 year ago

An example is some kind of matrix dimension mismatch or wrong data type. Are there any summary writing parts in your code that you can share here for debugging on the TF/TB side, such as tf.summary.xxx or code using event_file_writer or record_writer of some sorts? Otherwise I won't be able to reproduce this error.