googleapis / nodejs-logging-winston

Node.js client integration between Stackdriver Logging and Winston.
https://cloud.google.com/logging/
Apache License 2.0
105 stars 50 forks source link

Debug messages start after specific google cloud error #751

Closed dhdnicodemus closed 1 year ago

dhdnicodemus commented 1 year ago

Hi,

I'm seeing a peculiar behavior. We're using :

"@google-cloud/logging-winston": "^4.1.1"

Periodically, every couple of days or s, we see this error in our application's logs: Error: Total timeout of API google.logging.v2.LoggingServiceV2 exceeded 60000 milliseconds before any response was received

And from that point forward, it's as if the applications log level has changed to 'debug' and we see debug messages in our logs after this and it continues until the application's pods are restarted.

Any ideas on what is going on here?

Thanks, Dave

losalex commented 1 year ago

@dhdnicodemus , thanks a lot for opening an issue! Is there by any chance a way you could share with me the debug messages you getting? Are those debug messages come from your code or it seems comes from a nodejs-logging-winston or one of its dependencies?

BossaGroove commented 1 year ago

exactly same issue, once I'm getting Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received. from the pubsub module, @google-cloud/logging-winston is start ignoring my log level configuration and streaming all debug logs to GCP, and becuase of this bug it's already costing us hundred of dollars

losalex commented 1 year ago

@BossaGroove, one of the solutions is to handle the error through default callback - you can see how it can be added here. Make sure that you update to latest nodejs-logging-winston version since there were related bugs fixed recently. Please let me know if this helps.

losalex commented 1 year ago

I am going to close this issue since there was no answers so far - I believe that using default-callback supposed to help, please let me know if thats not a case

BossaGroove commented 1 year ago

@losalex I cannot replicate this issue at the moment because it relies on other GCP's node module to have a timeout request (say bigquery, pubsub, etc) will update if the default-callback trick does not work

dhdnicodemus commented 1 year ago

Thanks for your attention and feedback, sorry i didn't' get back to, things got busy on this end. Will try your recommendations.

fsikansi commented 1 year ago

I have the same issue using the default callback after update to version 5.2.3. Before that my application pod was restarting every time we got an Total timeout of API google.logging.v2.LoggingServiceV2 exceeded 60000 milliseconds before any response was received. due to uncaughtexception (using node 18), the default callback solved this issue but now I realize the application started to show debug logs on production just after the stdout have been called by the default callback implementation.

losalex commented 1 year ago

@fsikansi , can you perhaps share what kind of debug logs you are getting? Are those debug messages come from your code or it seems comes from a nodejs-logging-winston or one of its dependencies?

fsikansi commented 1 year ago

@losalex they are all from my application, it seems like the winston log level configuration changes from info to debug after the default callback been called.

losalex commented 1 year ago

I see - thanks @fsikansi for more context. Looking into code, I do not see if we manipulate/downgrade level upon exception - we do set level in constructor of TransportStream only, see here. It seems might be related to TransportStream behavior, will need to investigate it further.

dhdnicodemus commented 1 year ago

I saw the exact same as @losalex is describing.

losalex commented 1 year ago

I tried to repro a problem an apparently I was not able to repro it when I set the desired level parameter in LoggingWinston constructor. @dhdnicodemus and @fsikansi, would you please try to set the level in LoggingWinston constructor and retry? Below is a code I used to run my testing:

const winston = require('winston');
const { LoggingWinston } = require('@google-cloud/logging-winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.json(),
  transports: [
    new winston.transports.Console(),
    new LoggingWinston({
      level: "error",
      projectId: "your-project-id",
      keyFilename: 'your-key-file.json',
      maxRetries: 1,
      defaultCallback: (error: any, apiResponse: any) => {
        console.debug('defaultCallback called', {
          error,
          apiResponse,
        });
      },
    })
  ]
});

logger.error('Failing request to simulate timeout', {
  httpRequest: {
    requestMethod: 'GET',
    requestUrl: 'http://some-url.com',
    status: 200,
    latency: "1s", // Latency in wrong format on purpose so that send fails
  }
});

setTimeout(() => {
  logger.info("This should never be logged by LoggingWinston since transport level is error");
}, 
10000);
fsikansi commented 1 year ago

@losalex run similar tests here and it seems that moving the level config from createLogger to LoggingWinston constructor fixes the wrong behavior. Thanks

losalex commented 1 year ago

Thank you so much @fsikansi for your time and confirmation! I will close this issue for now - feel free to comment or reopen per need