opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
52 stars 107 forks source link

[BUG] Unable to create a new RollUp Job in OpenSearch 2.12 #1161

Closed karthikeyan21 closed 3 days ago

karthikeyan21 commented 2 months ago

What is the bug? RollUp Job creation fails with 500 error code in Opensearch 2.12

Error Message : {"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"java.time.Instant.plusMillis(long)\" because \"startTime\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"java.time.Instant.plusMillis(long)\" because \"startTime\" is null"},"status":500}

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Use the below API to create a new RollUp Job - Create RollUp Job Sample - curl -X PUT localhost:9200/_plugins/_rollup/jobs/test -H 'Content-Type:application/json' -d '{"rollup":{"target_index":"rollup_hourly_fmstats_test","description":"Hourly Stats Rollup","source_index":"test_*","enabled":true,"schedule":{"interval":{"period":60,"unit":"Minutes"}},"delay":0,"continuous":"true","metrics":[{"source_field":"abc.accepted","metrics":[{"max":{}}]},{"source_field":"abc.rejected","metrics":[{"max":{}}]},{"source_field":"abc.matched","metrics":[{"max":{}}]}],"page_size":5000,"dimensions":[{"date_histogram":{"fixed_interval":"60m","source_field":"timestamp"}},{"terms":{"source_field":"name"}}]}}'

  2. RollUp Job creation is fails with error (500)

What is the expected behavior? RollUp Job to be created and data to be rolled up

What is your host/environment?

Do you have any screenshots? NA

Do you have any additional context? I was debugging the code and noticed that we have not initialised Schedule Modifying the code to Instant.now() instead of schedule.startTime fixed the issue

Update - This doesn't affect the existing RolUp Jobs. Any job created using earlier version (2.10) seems to be working as the time is initialised

image
mgodwan commented 2 months ago

Related to https://github.com/opensearch-project/index-management/pull/1040

@bowenlan-amzn @ikibo Could you please check?

bowenlan-amzn commented 2 months ago

Yes, I think this is a miss and causes a breaking change. Regarding this https://github.com/opensearch-project/index-management/pull/1040#discussion_r1401982311 , if user doesn't pass in start_time, the schedule.startTime will be null, and will cause the exception when instantialize the IntervalSchedule.

The solution is to add schedule.startTime ?: Instant.now() back

sarthakaggarwal97 commented 2 months ago

@bowenlan-amzn so looks like schedule.startTime is not a required field. What do you think, should it be a required field?

ikibo commented 2 months ago

@mgodwan, thank U for this finding.

Good point, @bowenlan-amzn : the case when start_time is not defined in the request must have been handled. But the question is how?

according to the official rollup-api-doc schedule.interval.start_time is a required field (@sarthakaggarwal97 FYI).

@bowenlan-amzn plz help me understand what would be the best way to handle this issue

@bowenlan-amzn the same issue exists for the Transform job( the fix should be pretty much the same as for the roll-up). I think we can handle both under this ticket. Plz assign this issue to me.

bowenlan-amzn commented 2 months ago

@ikibo Thanks! The goal here is to not introduce breaking change. I think the documentation is wrong, obviously start_time is not a required field, as the example provided in this issue, if provided schedule like this

"schedule": {
            "interval": {
                "period": 60,
                "unit": "Minutes"
            }
        },

rollup can be created before, and start time default to current time. so please go with the second path

handling null check as U suggest (in this case, I would suggest changing the doc to determine that start-time is set to current time if not set explicitly in the request, making it 'kind-of' not mandatory)

also link the transform change #1040

louzadod commented 3 days ago

The workaround I used was replacing schedule.interval by schedule.cron. But I miss schedule.interval a lot.

bowenlan-amzn commented 3 days ago

@louzadod this has been fixed 2.14

louzadod commented 3 days ago

Hi. @bowenlan-amzn . Right after migration from 2.11 to 2.14, my rollup jobs configured with schedule.interval stopped running. By replacing schedule.interval with schedule.cron it started running again.

bowenlan-amzn commented 3 days ago

@louzadod it's probably not the same issue. Do you want to report a bug with the error you saw and some reproduce steps maybe?

louzadod commented 3 days ago

@bowenlan-amzn I'm getting the same error as reported in this bug and I'm running version 2.14.0.

GET /

{
  "name": "logs-corporativos-client-2",
  "cluster_name": "logs-corporativos",
  "cluster_uuid": "rgFOp61cTRKts3oqa4dAwA",
  "version": {
    "distribution": "opensearch",
    "number": "2.14.0",
    "build_type": "tar",
    "build_hash": "aaa555453f4713d652b52436874e11ba258d8f03",
    "build_date": "2024-05-09T18:51:00.973564994Z",
    "build_snapshot": false,
    "lucene_version": "9.10.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}

Here is my rollup definition:

{
    "rollup": {
        "rollup_id": "vulner-history-job",
        "enabled": true,
        "schedule": {
            "interval": {
                "period": 1,
                "unit": "Minutes"
            }
        },
        "enabled_time": null,
        "description": "Rollup job para sumarizar diariamente as vulnerabilidades",
        "schema_version": 16,
        "source_index": "vulnerabilities",
        "target_index": "vulner-history",
        "page_size": 1000,
        "delay": 0,
        "continuous": false,
        "dimensions": [
            {
                "date_histogram": {
                    "fixed_interval": "1d",
                    "source_field": "timestamp",
                    "target_field": "timestamp",
                    "timezone": "America/Sao_Paulo"
                }
            },
            {
                "terms": {
                    "source_field": "severity",
                    "target_field": "severity"
                }
            },
            {
                "terms": {
                    "source_field": "stack_prefix",
                    "target_field": "stack_prefix"
                }
            },
            {
                "terms": {
                    "source_field": "stack",
                    "target_field": "stack"
                }
            },
            {
                "terms": {
                    "source_field": "service",
                    "target_field": "service"
                }
            }
        ],
        "metrics": [
            {
                "source_field": "event_count",
                "metrics": [
                    {
                        "sum": {}
                    }
                ]
            }
        ]
    }
}

After invoking the API for creating the rollup, here is the message:

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"java.time.Instant.plusMillis(long)\" because \"startTime\" is null"}],"type":"null_pointer
_exception","reason":"Cannot invoke \"java.time.Instant.plusMillis(long)\" because \"startTime\" is null"},"status":500}
bowenlan-amzn commented 2 days ago

@louzadod just did a quick check. 2.14 didn't pick up this fix, it's in 2.15

louzadod commented 2 days ago

ok. thanks for the confirmation, @bowenlan-amzn .