logstash-plugins / logstash-input-jdbc

Logstash Plugin for JDBC Inputs
Apache License 2.0
449 stars 187 forks source link

Input crashes on TZInfo::AmbiguousTime Error when setting jdbc_default_timezone #121

Closed proligde closed 4 months ago

proligde commented 8 years ago

Hi there! First of all - thanks a lot for this plugin - it's great!

However, when using my MySQL DB's timestamp in elasticsearch/kibana I noticed it was off by 1h since the MySQL timestamp is interpreted as UTC but in fact is CET.

So i added: jdbc_default_timezone => "Europe/Berlin"

But now my logstash-pipeline crashes fatal resulting in an endless loop of trying to read data from MySQL with this error:

A plugin had an unrecoverable error. Will restart this plugin.
  Plugin: <LogStash::Inputs::Jdbc jdbc_connection_string=>"jdbc:mysql://localhost:3306/MYDBNAME?zeroDateTimeBehavior=convertToNull", jdbc_user=>"MYDBUSER", jdbc_password=><password>, jdbc_driver_library=>"/driver/mysql-connector-java-5.1.35-bin.jar", jdbc_driver_class=>"com.mysql.jdbc.Driver", jdbc_default_timezone=>"Europe/Berlin", statement_filepath=>"/config/queries/shop_order_item.sql", type=>"shop_order_item", codec=><LogStash::Codecs::Plain charset=>"UTF-8">, jdbc_paging_enabled=>false, jdbc_page_size=>100000, jdbc_validate_connection=>false, jdbc_validation_timeout=>3600, jdbc_pool_timeout=>5, sql_log_level=>"info", parameters=>{"sql_last_value"=>1970-01-01 00:00:00 UTC}, last_run_metadata_path=>"/var/lib/logstash/.logstash_jdbc_last_run", use_column_value=>false, clean_run=>false, record_last_run=>true, lowercase_column_names=>true>
  Error: TZInfo::AmbiguousTime: 2015-10-25T02:49:45+00:00 is an ambiguous local time. {:level=>:error}

I'm aware that this might probably more like an tzinfo related problem (https://github.com/tzinfo/tzinfo/issues/32) but I don't see any other possibility to make it work correctly.

Exporting a UNIX-timestamp in MySQL and using a date input filter combined with the "unix" timestamp parser is (for whatever reason) painfully slow resulting in jammed elasticsearch input queues.

Best - Max

razafinr commented 8 years ago

Hi!

I am having the same error:

Error: TZInfo::AmbiguousTime: 2015-10-25T02:10:03+00:00 is an ambiguous local time

It is quite a problem that this is causing logstash to restart endlessly.

proligde commented 8 years ago

As far as I understand the issue logstash-input-jdbc would have to let us configure (like tzinfo does it) what fallback (pick first or second point in time) to perform in case of an ambigous time stamp. As long as it doesn't to this it's somewhat "correct" to fail fatally.

Here is my understanding of the whole thing - please correct me if I'm wrong:

I know - that's all evil stuff and everyone should only write UTC timestamps to avoid such awkwardness but in reality there are a lot of databases out there logging stuff in local timezones that support daylight saving time shifts - so support for that would be great :-)

razafinr commented 8 years ago

I totally agree with you (timezones are just so annoying to handle).

Probably the local_to_utc conversion is global to all logstash plugins and not for the jdbc input only ? I don't know if it is a specific fix for this plugin or for logstash itslef.

proligde commented 8 years ago

Good point - I don't know either. If one of the contributors of this plugin could confirm or negate that we could keep this ticket here or move it to the proper elastic project instead.

I guess in the end there should be a global or plugin specific setting, leading to these calls:

tz.local_to_utc(d) { |periods| periods.first } or tz.local_to_utc(d) { |periods| periods.last }

as mentioned in the linked tzinfo issue.

mbertani commented 8 years ago

I have the same issue here. I am importing data from a database with local timestamps. Even handling the exception and returning a _dateparsefailure in logstash would help. Then we could redirect the output to a file an parse accordingly.

razafinr commented 8 years ago

Coming back to this issue, I think that the problem comes from the named_timezoneextension used by Sequel.

As you can see here:

https://github.com/jeremyevans/sequel/blob/ac925ce9556f33d56f49b84d905d307c6a621716/lib/sequel/extensions/named_timezones.rb

There is an attribute tzinfo_disambiguatorwhich could be used to remove the ambiguity by choosing wich period to use periods.first or periods.last in the application.

If I could send a PR I would, but I am really unfamiliar with ruby language so I hope this will help to fix this issue.

hameno commented 7 years ago

Any update or workaround for this? My import job is now stuck with this error...

passing commented 7 years ago

we are using the following workaround to convert CET dates without timezone to UTC in the SQL statement (running against oracle database):

select from_tz(cast(<column> as timestamp), 'CET') at time zone ('UTC') "@timestamp"
from <table>
where ...

then ensure that timezone setting of the local machine or "jdbc_default_timezone" is UTC

maybe something similar is possible with MySQL?

yaauie commented 6 years ago

Unfortunately, the setting linked for supplying a tzinfo_disambiguator is a global setting for Sequel's named timezone extension, which means we don't have a way of configuring a particular input's handling.

In a brief spike, I have created a timezone proxy object using SimpleDelegator, whose period_for_local has an implicit preference for DST being either enabled or disabled, which seems to work:

# Quacks like a Timezone
class TimezoneWrapper < SimpleDelegator
  ##
  # Wraps a ruby timezone object in an object that has an explicit preference in time conversions
  # either for or against having DST enabled.
  #
  # @param timezone [String,TZInfo::Timezone]
  # @param dst_enabled_on_overlap [Boolean] (default: nil) when encountering an ambiguous time,
  # declare a preference for selecting the option with DST either enabled or disabled.
  def self.wrap(timezone, dst_enabled_on_overlap)
    timezone = TZInfo::Timezone.get(timezone) if timezone.kind_of?(String)
    new(timezone, dst_enabled_on_overlap)
  end

  ##
  # @api private
  def initialize(timezone, dst_enabled_on_overlap)
    super(timezone) # SimpleDelegator
    @dst_enabled_on_overlap = dst_enabled_on_overlap
  end

  ##
  # @override `Timezone#period_for_local`
  # inject an implicit preference for DST being either enabled or disabled if called
  # without an explicit preference
  def period_for_local(value, dst_enabled_on_overlap=nil, &global_disambiguator)
    dst_enabled_on_overlap = @dst_enabled_on_overlap if dst_enabled_on_overlap.nil?
    __getobj__.period_for_local(value, dst_enabled_on_overlap, &global_disambiguator)
  end
end

In this way, we could set the Database#timezone=(tz) with our proxy object, which would resolve ambiguous timestamps as configured:

if @jdbc_default_timezone
  @database.timezone = TimezoneWrapper.wrap(@jdbc_default_timezone, @jdbc_default_timezone_dst_enabled_on_overlap)
end

A couple open questions:

(cc: @guyboertje @jakelandis)

marbin commented 6 years ago

I agree with @mbertani: Would help a lot just to get a proper exception and a _dateparsefailure or something similar. Now, logstash silently crashes and drops events. Took me a long time to figure out what was going on.

marbin commented 6 years ago

I don't really understand why this is not labeled as a bug. If I understand this correctly, the jdbc_default_timezone cannot be used in countries using daylight saving time.

guyboertje commented 6 years ago

@marbin The bug only surfaces once a year, when setting the jdbc_default_timezone and the input is processing database records that have timestamps recorded during the change from daylight savings and the timezone in use has ambiguous results, i.e. ones that observe daylight savings.

As @yaauie points out, the setting that is made available for us to change in Sequel (the 3rd party ORM) that we use, is a global setting meaning that all jdbc inputs, jdbc_streaming and jdbc_static filters in a LS instance (across all pipelines) will have the same disambiguation set. Due to this global effect, setting a timezone in one plugin affects all plugins.

Unfortunately, either way the disambiguation proc chooses (first or last) will create some timestamps that overlap - there is no way to avoid this.

marbin commented 5 years ago

@guyboertje You mean that just because the bug only occurs in timezones with daylight savings, i.e. all of Europe, US, Australia etc and just once a year, it is not a bug ;-) The millennium bugs were no bugs because they only occured once in a millennium? ;-)

Joking aside, In my case, we had several pipelines that just crashed and it was pretty tricky to figure out what was going on, since there were no error messages in the logs. Had to turn on debugging to find it. Then I had to hack the last_run_metadata to skip the events that occured during the hour of the daylight saving time change, losing all these events. And all these pipelines were down until we had found and fixed the issue.

I understand this is a hard one to fix, but I think it would make sense to track it as a bug and maybe a warning in the docs for the jdbc_default_timezone feature, since using it with a timezone with daylight saving WILL crash your pipeline.

guyboertje commented 5 years ago

@marbin There is a bug or three flying about somewhere. The 'enhancement' here plus the global nature of the sequel library plus the scheduler also has problems with TZ bugs (some other plugins have pinned rufus/scheduler at an old old version and we need a big test to see what will fail if we remove the pinning). I'm not saying its not a bug.

We do have plans to switch all our jdbc plugins to a common reusable Java framework based around HikariCP, Vanilla JDBC and Quartz but this is a lot of work but we feel there are many niggling faults with the current code across all jdbc plugins that warrant the effort.

Therefore, fixing these issues and having the current set of tests pass is the goal of the move to Java project.

science695 commented 5 years ago

This "bug" doesn't only happen if you have data with a date in this 1 hour interval, it also does it you have a job that runs in that hour using a datefield as the index column.

If this happens, then an ambiguous date is stored in the meta file and that job will fail until you manually change / clear that meta file.

marcoacm commented 4 years ago

I am having this issue now. Any changes? Any other workarounds?

spawnrider commented 3 years ago

Same for me. Any update on this ?

bjolivot commented 2 years ago

hello, any news ?

jehutywong commented 2 years ago

Same here. any workaround example for mysql 5.5?

bjolivot commented 2 years ago

Hello new hit on that bug tonight, no progress on the AmbiguousTime problem ?

Rick25-dev commented 2 years ago

Problem occurs here as well, no logs are loaded since hour change, whats the progress on this issue?

s137 commented 1 year ago

I have encountered the same issue, any news on when this is going to be resolved? There is already a PR waiting for merge in the repo of the integration plugin: https://github.com/logstash-plugins/logstash-integration-jdbc/pull/70

flyingman15 commented 1 year ago

As long as jdbc crashes on this, for MySQL, you can try to avoid the issue like this :
statement => "SELECT CONVERT_TZ(date_created, 'Europe/Paris', 'UTC') AS date_created FROM your_table
jdbc_default_timezone => "GMT"

(You'll need to populate tz infos of your mysql db first if it's not already the case with : mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql)

CONVERT_TZ should handle properly this annoying AmbiguousTime Error

s137 commented 1 year ago

Any updates on this? Unfortunately it's currently not really feasible for us to convert everything to UTC.

yaauie commented 1 year ago

Since v5.4.0 of this plugin (which is now distributed as a part of the JDBC Integration plugin for Logstash), you can specify the behaviour when encountering an ambiguous timestamp:

While it is common to store local times in SQL’s timestamp column type, many timezones change their offset during the course of a calendar year and therefore cannot be used with SQL’s timestamp type to represent an ordered, continuous timeline. For example in the America/Chicago zone when daylight saving time (DST) ends in the autumn, the clock rolls from 01:59:59 back to 01:00:00, making any timestamp in the 2-hour period between 01:00:00CDT and 02:00:00CST on that day ambiguous.

When encountering an ambiguous timestamp caused by a DST transition, the query will fail unless the timezone specified here includes a square-bracketed instruction for how to handle overlapping periods (such as: America/Chicago[dst_enabled_on_overlap:true] or Australia/Melbourne[dst_enabled_on_overlap:false]).

-- Logstash JDBC Input Plugin -- jdbc_default_timezone -- Ambiguous Timestamps


Versions of this plugin with this feature were first shipped with Logstash v8.6.0.

Logstash 7.6+ and 8.0+ shipped with the JDBC Integration plugin, so it can be updated with the normal method:

bin/logstash-plugin update logstash-integration-jdbc

Logstash < 7.5 will need to remove the stand-alone JDBC plugins prior to installing the integration plugin, which will look something like:

bin/logstash-plugin remove logstash-input-jdbc
bin/logstash-plugin remove logstash-filter-jdbc_static
bin/logstash-plugin remove logstash-filter-jdbc_streaming

bin/logstash-plugin install logstash-integration-jdbc
s137 commented 1 year ago

@yaauie Thank you so much. That helps a lot, unfortunately because of another issue in Elasticsearch itself, we cannot upgrade to Logstash 8.6.0 just now, but as you pointed out, I can just upgrade the integration-plugin in Logstash 8.3.3