keedio / flume-ng-sql-source

Flume Source to import data from SQL Databases
Apache License 2.0
264 stars 164 forks source link

control character as delimiter entry, eg: \0001 #65

Closed zhipcui closed 6 years ago

zhipcui commented 6 years ago

Mostly time sql data is structural data and can be import to hive directly, but the delimiter char will be problem if using normal visible character. Control character will be good, but we can't set it, Flume will call value.trim(), all space chars just gone.

lazaromedina commented 6 years ago

Hi zhipcui, thank you very much for your contribution. I dont agree with using control character as delimiter entry for structural data from a sql environment. I think that printable chars, as delimiter entry, cover the most common cases of use.

Your are wright: when 'delimiter.entry' property is explicitly set with a control char, FlumeConfiguration warns, and sets to default DEFAULT_DELIMITER_ENTRY (and value does not go through the class FlumeConfiguration and for this reason it works when default delimiter is set in code to \u0001)

[WARN - org.apache.flume.conf.FlumeConfiguration.<init>(FlumeConfiguration.java:101)] Configuration property ignored: agent.sources.sql1.delimiter.entry = <0x01>

Such a warning is triggered from FlumeConfiguration.java - 1.8.0 :

 // Empty values are not supported
    if (value.trim().length() == 0) {
      errors
          .add(new FlumeConfigurationError(name, "",
              FlumeConfigurationErrorType.PROPERTY_VALUE_NULL,
              ErrorOrWarning.ERROR));
      return false;
    }

Value form control chars are evaluted to true (and returning false) because value empty chain is trimmed.

In trunk Flume 1.9.0-Snapshot seems to be solved:

 // Empty values are not supported
    if (value.isEmpty()) {
      addError(name, PROPERTY_VALUE_NULL, ERROR);
      return false;
    }

In the next snaphsot, empty chain from \u0001 is, does not return false when is evaluated in the condition of the code, because value.isEmtpy() (flume 1.9.0) is false. captura de pantalla 2018-08-14 a las 11 06 43

So for this reasons, i think that keeping default delimiter entry as printable char, specifically the comma, is the best solution. For ingesting data from a sql source to a sink (like hive sink?) with control char as delimiter, maybe a parser should be used, at least, until Apache Flume releases 1.9.0 a we upgrade Flume-sql-source to flume-core 1.9.0.

best, Luis

zhipcui commented 6 years ago

Thanks for your response. I agree that visible chars cover the most common cases.