logstash-plugins / logstash-input-s3

Apache License 2.0
57 stars 150 forks source link

Plugin dont process objects correctly, dont delete or backup #240

Open dabelousov opened 2 years ago

dabelousov commented 2 years ago
  1. Logstash -oss version 7.16-8.1
  2. Docker
  3. K8s - Openshift 4.7
  4. Included in image

Hello. Trouble in S3 input plugin with private S3 like AWS Minio. Logstash normally read object and send to output, but backup or delete is not working. Object staying in source bucket with no changes, objects are small json access log files, average size is 1-2 kB.

Input config:
   input {
      s3 {
        access_key_id => "${S3_ACCESS_KEY}"
        secret_access_key => "${S3_SECRET_KEY}"
        endpoint => {{ $.Values.s3_connect_endpoint | quote }}
        bucket => "test-bucket"
        prefix => "prefix"
        backup_to_bucket => "backup-bucket"
        backup_add_prefix => "processed"
        delete => true
      }
    }

IAM role is allowed to any actions, checked that by delete object with mcli tool. In S3 access logs i see only success (200) GET and HEAD, and no one PUT, POST or DELETE. In logstash log i see only success logs like below

{"level":"INFO","loggerName":"logstash.inputs.s3","timeMillis":1646814827669,"thread":"[main]<s3","logEvent":{"message":"epaas-caasv3-backups/2022-03-05-09-20-02-312 is updated at 2022-03-05 06:20:02 +0000 and will process in the next cycle"}}

{"level":"INFO","loggerName":"logstash.inputs.s3","timeMillis":1646814827800,"thread":"[main]<s3","logEvent":{"message":"epaas-caasv3-backups/2022-03-05-09-20-02-396 is updated at 2022-03-05 06:20:02 +0000 and will process in the next cycle"}}

{"level":"INFO","loggerName":"logstash.inputs.s3","timeMillis":1646814827932,"thread":"[main]<s3","logEvent":{"message":"epaas-caasv3-backups/2022-03-05-09-20-03-185 is updated at 2022-03-05 06:20:03 +0000 and will process in the next cycle"}}
33

Found some interesting code https://github.com/logstash-plugins/logstash-input-s3/blob/main/lib/logstash/inputs/s3.rb#L383

As i understand - plugin compare last_modified of object and log, and according to my log - postpone object processing to next cycle, and after default 60 seconds it repeating again.

Also trying to set sincedb_path => "/tmp/logstash/since.db" , but it is not creating. Objects from bucket downloaded in /tmp/logstash/ and staying there.

dabelousov commented 2 years ago

Im fixed plugin in fork https://discuss.elastic.co/t/s3-input-dont-delete-backup-files-after-processing/299187

pebosi commented 2 years ago

Same problem here, using logstash 8.2.0 docker image. Switched to fork...

Derekt2 commented 1 year ago

Same here, switched to fork.

lysenkojito commented 1 year ago

@kaisecheng any ideas why it happens?

kaisecheng commented 1 year ago

The reason for comparing the last modified time of object and log is to confirm the object is not updated since the list action. If the object gets updated, its last modified time will bring it to the next cycle. Deleting the comparison leads to duplication/ reprocessing of ingested data.

Also trying to set sincedb_path => "/tmp/logstash/since.db" , but it is not creating.

The plugin can't work properly without sincedb. Maybe the Logstash user lack of permission to write in the path? Enabling debug log should give some hints

lysenkojito commented 1 year ago

The reason for comparing the last modified time of object and log is to confirm the object is not updated since the list action. If the object gets updated, its last modified time will bring it to the next cycle. Deleting the comparison leads to duplication/ reprocessing of ingested data.

Also trying to set sincedb_path => "/tmp/logstash/since.db" , but it is not creating.

The plugin can't work properly without sincedb. Maybe the Logstash user lack of permission to write in the path? Enabling debug log should give some hints

we use minio s3 bucket with admin s3:* permissions. Logstash reads logs good, but repeats reading them all the time

kaisecheng commented 1 year ago

but repeats reading them all the time

It sounds like the plugin has an issue updating the sincedb. To compare object timestamps, Logstash needs to write the last modified time to sincedb, otherwise, the objects are reprocessed in the next cycle. Please check if Logstash is able to write to sincedb_path and if the file (sincedb) is updated successfully.

lysenkojito commented 1 year ago

but repeats reading them all the time

It sounds like the plugin has an issue updating the sincedb. To compare object timestamps, Logstash needs to write the last modified time to sincedb, otherwise, the objects are reprocessed in the next cycle. Please check if Logstash is able to write to sincedb_path and if the file (sincedb) is updated successfully.

should I write smth to sincedb_path? and how to check if Logstash is able to write to sincedb_path?

lysenkojito commented 1 year ago

but repeats reading them all the time

It sounds like the plugin has an issue updating the sincedb. To compare object timestamps, Logstash needs to write the last modified time to sincedb, otherwise, the objects are reprocessed in the next cycle. Please check if Logstash is able to write to sincedb_path and if the file (sincedb) is updated successfully.

I tried to run simultaneously two pipelines: one using aws s3 bucket, another one - minio s3 bucket. In both cases I found no errors in debug mode.

There was written that both pipelines have default sincedb file created, BUT there was only one existed at the mentioned path - for aws bucket.

It’s not local filesystem permissions, not minio permissions (because we use admin credentials). There is a lack of logs to understand why it happened.

Please advice how to debug and fix it.

@kaisecheng

kaisecheng commented 1 year ago

@lysenkojito The permission I refer to is the user running Logstash should have enough privilege to write on disk in sincedb_path. Taking docker environment as an example, the default user is logstash.

  1. Make sure logstash user can read and write the path sincedb_path
  2. Make sure each s3-input has unique sincedb_path (this setting must be a filename path and not just a directory)

BUT there was only one existed at the mentioned path - for aws bucket.

Are you setting the same sincedb path in both pipelines? If paths are unique, I would expect to see error in log for minio s3. The best path forward for you is to create a new issue including a reproducer with debug log, config and pipelines for further investigation if you believe it is a bug. We support AWS s3 officially. The help for minio s3 will be limited.

lysenkojito commented 1 year ago

@lysenkojito The permission I refer to is the user running Logstash should have enough privilege to write on disk in sincedb_path. Taking docker environment as an example, the default user is logstash.

  1. Make sure logstash user can read and write the path sincedb_path
  2. Make sure each s3-input has unique sincedb_path (this setting must be a filename path and not just a directory)

BUT there was only one existed at the mentioned path - for aws bucket.

Are you setting the same sincedb path in both pipelines? If paths are unique, I would expect to see error in log for minio s3. The best path forward for you is to create a new issue including a reproducer with debug log, config and pipelines for further investigation if you believe it is a bug. We support AWS s3 officially. The help for minio s3 will be limited.

@kaisecheng Sincedb paths were set by default. They had different names, but one folder -…/s3/

It’s definitely not permissions issue. okay, I’ll create an issue. Thank you

volter commented 3 weeks ago

In my setup with minio, I found that the problem to be that the compared timestamps are not exactly equal. One of them is 12345678.0, the other one 12345678.863. I don't fully understand where these timestamps are coming from, thus I don't know if precision matters.

This is a systematic problem with these two bits of information and the code never takes the turn into creating a sincedb entry or deleting objects. As a consequence Logstash is looping madly, consuming a lot of CPU time.