neptune-ai / neptune-client

📘 The experiment tracker for foundation model training
https://neptune.ai
Apache License 2.0
587 stars 63 forks source link

BUG: Error msg during training - Timestamp must be non-decreasing for series attribute #733

Open stonelazy opened 3 years ago

stonelazy commented 3 years ago

Describe the bug

When running Neptune logger in PytorchLightning with ddp > 1gpus. Then there are continuous errors reading Error occured during asynchronous operation processing. Timestamp must be non-decreasing for series attribute If the Neptune logger is offline, or if neptune logger is removed then this error isn't logged. There are too many errors, and even the progress bar of the training is difficult to identify.

Reproduction

When i run with 4gpus, i was able to reproduce this. https://colab.research.google.com/drive/1TOadmpet63eSXz6LMHVvdM-D6Gy0LDxe?usp=sharing

Expected behavior

If this is a valid error message, then there is no hint of what actions needs to be taken. If they are harmless/not valid kindly suggest a way to suppress this print.

Traceback

Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/stdout. Invalid point: 2021-10-15T13:25:02.767Z

Environment

The output of pip list: PyTorch version: 1.9.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.21.3 Libc version: glibc-2.31

Python version: 3.8.11 (default, Aug 3 2021, 15:09:35) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.11.0-37-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: RTX A6000 GPU 1: RTX A6000 GPU 2: RTX A6000 GPU 3: RTX A6000 GPU 4: RTX A6000 GPU 5: RTX A6000 GPU 6: RTX A6000 GPU 7: RTX A6000

Nvidia driver version: 460.91.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries:

Versions of relevant libraries: [pip3] mypy==0.910 [pip3] mypy-extensions==0.4.3 [pip3] neptune-pytorch-lightning==0.9.7 [pip3] numpy==1.21.2 [pip3] pytorch-lightning==1.4.9 [pip3] torch==1.9.0+cu111 [pip3] torch-poly-lr-decay==0.0.1 [pip3] torchaudio==0.9.0 [pip3] torchmetrics==0.4.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.3.0 h06a4308_520 [conda] mkl-service 2.4.0 py38h7f8727e_0 [conda] mkl_fft 1.3.0 py38h42c9631_2 [conda] mkl_random 1.2.2 py38h51133e4_0 [conda] mypy 0.910 pypi_0 pypi [conda] mypy-extensions 0.4.3 pypi_0 pypi [conda] neptune-client 0.12.0 pypi_0 pypi [conda] neptune-contrib 0.27.3 pypi_0 pypi [conda] neptune-pytorch-lightning 0.9.7 pypi_0 pypi [conda] numpy 1.21.1 pypi_0 pypi [conda] numpy-base 1.21.2 py38h79a1101_0 [conda] pytorch-lightning 1.4.9 pypi_0 pypi [conda] torch 1.9.0+cu111 pypi_0 pypi [conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi [conda] torchaudio 0.9.0 pypi_0 pypi [conda] torchmetrics 0.4.1 pypi_0 pypi

Additional context

image
kamil-kaczmarek commented 3 years ago

Hey @stonelazy,

Thanks for this report.

You have more than 1 process writing in async way to the same field in neptune (monitoring/gpu_3 for example). In Neptune, for given field, we expect the step to be strictly increasing (x, x+1, x+1.0001, x+4, x+5, ...). When it's not true you see an error that you reported here. As a consequence this point is dropped and your run progress uninterrupted.

In common use like run["x"].log(2), Neptune takes consecutive integers staring at 0. This is what you observe in your loss charts on x axis.

So what is happening in your case is the following (all processes log to the same run, async):

process A: run["x"].log(value=7, step=1) (ok)
process B: run["x"].log(value=3, step=2) (ok)
process A: run["x"].log(value=1, step=2) (error)
process A: run["x"].log(value=4, step=3) (ok)
process B: run["x"].log(value=0, step=3) (error)
process A: run["x"].log(value=9, step=4) (ok)

In it not an error that will crash your run, at the same time, I see that it can be annoying.

Quick workaround is to disable stdout/err and hardware monitoring - as this is what I see gives you the error msgs (docs).

neptune.init(
    capture_stdout=False,
    capture_stderr=False,
    capture_hardware_metrics=False,
    ...,
    )

You will not get these error msgs at the expense of less granular view into your runs.

At Neptune we see this use case, and we are working on better handling of such needs. Wanted to ask if you would be open to have a short online meeting with me -> I'm preparing some best practices for our users and - at the same time - I'm passing insights to the product team. I see that your insights/needs would help us a lot in making it right! https://calendly.com/kamil-neptune/30min

stonelazy commented 3 years ago

Thanks for the details. Yeah, i had it figured on disabling those 3 flags i didn't get the error and have been running my training with that for few days now. But the workaround comes at a dataloss, no longer we would be able to monitor CPU/GPU metrics. This workaround doesn't seem to be convincing. And continuous error messages are totally annoying. I guess this issue feels like a showstopper for me.

kamil-kaczmarek commented 3 years ago

@stonelazy,

I appreciate your feedback. I will get back here with more details soon.

kamil-kaczmarek commented 3 years ago

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.

I already passed info about your case to the product team for further consideration.

Let me know what you think?

kamil-kaczmarek commented 3 years ago

Hi @stonelazy,

I'm thinking what is your current status / thinking when it comes to the above issues?

stonelazy commented 3 years ago

Hi @kamil-kaczmarek am now not having any issues with the suggested workaround.

kamil-kaczmarek commented 3 years ago

@stonelazy ok good,

I recommend to transition to the PTL 1.5 and use our new logger:

stonelazy commented 3 years ago

If we move to PTL 1.5, we won't have this error ?

kamil-kaczmarek commented 3 years ago

@stonelazy unfortunately not, I'm sorry.

Referred to the new logger, due to improvements in the logger in other dimensions, especially it uses our new API under the hood.

qchenevier commented 2 years ago

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

For people stumbling on this issue: you can also "cleanly" silence neptune logs in kedro using the logging.yml configuration file. To do so, you can add those lines in the loggers section of the file in conf/base/logging.yml:


loggers:
# [ OTHER LOGGERS CONFIGURATION ...]

neptune.new.internal.operation_processors.async_operation_processor:
    level: CRITICAL
    handlers: [console, info_file_handler, error_file_handler]
    propagate: no
Blaizzy commented 2 years ago

Hi @stonelazy,

Prince Canuma here, a Data Scientist at Neptune.ai,

I want to personally inform you of the good news! This issue is now fixed on the latest release of PyTorch-Lightning v1.5.7 🎊 🥳

All you need to do is upgrade the library to the latest release👍

Happy Christmas and a prosperous New Year in advance!

stevenmanton commented 1 year ago

I'm still seeing this error with pytorch-lightning==1.9.1 and neptune-client==0.16.17.

Blaizzy commented 1 year ago

Hi @stevenmanton

Could you share a minimal reproducible code and the setup you have(i.e., no of GPUs and strategy) ?

mrembalski commented 1 year ago

The issue is still present. Just logging the hardware metrics with more than 1 process results in a lot of spam. I don't know why this issue is closed. Please reopen it.

Blaizzy commented 1 year ago

Hi @mrembalski

Bummer that you're experiencing some issues. No worries, we're here to help!

I have a couple of questions that'll help me understand the situation better:

Once I have this info, we'll dive right in and get things sorted out for you!

By the way, if you are using PyTorch we have since released a guide on our docs on how to track DDP jobs. Read more here: https://docs.neptune.ai/tutorials/running_distributed_training/

mrembalski commented 1 year ago

I'm using Pytorch Lightning. I'll try to create a snippet tomorrow.

Blaizzy commented 1 year ago

Perfect!

I am looking forward to it :)

harishankar-gopalan commented 1 year ago

Hi @Blaizzy I ham trying to use Neptune logger in pure Pytorch through Fairseq i.e trying to integrate Neptune logger for Fairseq package which already supports lots of your comptetitors like, WANDB, AzureML, Aim etc.

I am also facing the same issue. I have run the job only on one GPU rank and still getting the logs as stated above(reproduced below).

2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.659Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.985Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:46.987Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.370Z
2023-04-16 18:04:48 | ERROR | neptune.internal.operation_processors.async_operation_processor | Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/bb14c23a/stdout. Invalid point: 2023-04-16T12:34:47.371Z

Where is this exactly an issue ? As in, is there someway I can make changes to prevent this issue from occuring, some kind of sync/wait primitives available in neptune that I am not calling ? Any direction would help me better handle the logger implementation for Fairseq.

kbruegge commented 1 year ago

@stonelazy,

For now I have some temporary workaround. You can try to suppress these errors like this:

import logging
logging.getLogger("neptune.new.internal.operation_processors.async_operation_processor").setLevel(logging.CRITICAL)

Not sure if it will work for GPUs as well, but it's worth trying. At the same time please use it wisely and as temporary workaround as it disables all errors from async processing.

I already passed info about your case to the product team for further consideration.

Let me know what you think?

Hey, the import changed (the "new" is gone from the path). Here is how I ignore these messages for now:

logging.getLogger("neptune.internal.operation_processors.async_operation_processor").setLevel(
    logging.CRITICAL
)
danieldanciu commented 2 months ago

This issue is still very much not resolved, despite having been marked as resolved. Our logs are absolutely illegible, because this message appears millions and millions of times:

Screen Shot 2024-09-17 at 11 21 05 AM

Note that this is training on a single machine, so this cannot be caused by multiple loggers with slightly out-of-sync clocks. As far as I can tell, Neptune is actually trying to log messages with identical timestamps as keys, and this is causing this error. The following 3 messages, all have the same timestamp up to a millisecond precision. Unless Neptune uses microsecond precision for the logs, I think this must be the issue:

2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
-- | --
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
2024/09/17 11:29:29 | [neptune] [error ] Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute: monitoring/resources/stdout. Invalid point: 2024-09-17T09:29:24.366Z
danieldanciu commented 2 months ago

On a different note, a better way to filter out these spurious error messages without affecting actual legitimate error logs is to do something like:

class _FilterCallback(logging.Filterer):
    @override
    def filter(self, record: logging.LogRecord):
        return not (
            record.name == "neptune"
            and record.getMessage().startswith(
                "Error occurred during asynchronous operation processing: Timestamp must be non-decreasing for series attribute"
            )
        )

...
neptune.internal.operation_processors.async_operation_processor.logger.addFilter(_FilterCallback())
SiddhantSadangi commented 2 months ago

Hey @danieldanciu 👋 Thanks for pointing this out. While not a bug (more of a design issue), it should not have been marked as resolved.

We are working on an updated client version that will address this issue. Meanwhile, the workaround you've suggested to filter out such logs works like a charm, and we'll add it to our docs ✅