rpbouman / pentaho-pdi-plugin-jdbc-metadata

A transformation step for pentaho data integration that gives access to the various metadata resultsets of the JDBC DatabaseMetaData object.
Apache License 2.0
3 stars 8 forks source link

Add "Kettle Datatype" to the output columns #2

Closed codek closed 9 years ago

codek commented 9 years ago

I'd like to know the Kettle Datatype in addition to the database datatype. Could that be added?

It would save us re-doing the already existing mappings from e.g. VARCHAR->String etc.

Is it even possible? Does it make sense as to why? (Essentially I'm using the data to metadata inject, so obviously i need PDI datatypes for that to make sense.)

rpbouman commented 9 years ago

Hi Dan! Yes I think it makes sense. So in the output grid, add another column to choose the kettle datatype and use a sensible default?

cdeptula commented 9 years ago

@rpbouman I think what @codek was suggesting was adding a column to the output of the step when the step is returning a list of columns that maps the SQL data type to the Kettle data type. So you would have a new output field KETTLE_TYPE_NAME when using the columns method. Basically if you selected that column in a query, what type would Kettle assign to the column.

I looked at this last week and Pentaho does not make this easy. The built in Pentaho mechanism to determine the type of the column from SQL requires you to query the actual column. So in order to use the built in method you would have to run a select column from table query for every column. Obviously this is not ideal for this step and it would be better to have a function where you passed the SQL type integer too and got the Pentaho type back. See the org.pentaho.di.core.row.value.ValueMetaBase and org.pentaho.di.core.database.Database classes. Specifically the getValueFromSQLType function.

Option 2 is to recode the getValueFromSQLType function manually into this step; however, this will not work with datatype plugins and would have to be maintained anytime Pentaho changes the type mapping. I do not know if this approach is worth the risk of it being wrong...

What are your thoughts?

rpbouman commented 9 years ago

@cdeptula I sent Dan a request for more info.

Meanwhile: "I looked at this last week and Pentaho does not make this easy. The built in Pentaho mechanism to determine the type of the column from SQL requires you to query the actual column."

Really? Can't you just use a "Metadata from Stream" step and get the kettle data type for any stream, including a stream on a 0-row table input step?

codek commented 9 years ago

Hi Guys,

Yes @cdeptula seems to have the right idea. An additional column called KETTLE_DATATYPE

So; An oracle field would be defined like so:

VARCHAR

(Case sensitive)

But the KETTLE_DATATYPE would have the value

String

Oracle also reports things like CHAR, VARCHAR2 all of which map to PDI String.

Same with numbers, various databases report Decimal, Double, Numeric etc, in PDI terms it's either Number or Integer.

I'd have thought @rpbouman was right though - metadata structure of stream somehow manages this, although it does only report once data actually flows through so i don't know precisely whether or not it can get the metadata just from the query metadata or whether it does wait until it is executed.

(Metadata structure of stream is an invaluable step these days!)

rpbouman commented 9 years ago

@cdeptula @codek I'd have to think a little bit about this. Currently the step is a more or less 1:1 data source for jdbc metadata, which obviously does not include a kettle data types.

Is it like @cdeptula already said, that this is a feature that would come in handy mainly for the Columns method?

The columns resultset includes a "DATA_TYPE" column, which contains an integer corresponding to java.sql.Type. See this for the full list + value: http://docs.oracle.com/javase/7/docs/api/constant-values.html#java.sql.Types.BIT

(It also includes a TYPE_NAME column, and that is indeed the database specific name for that type)

For now, would it be acceptable to add a value mapper step to convert DATA_TYPE from the jdbc step to a string value representing the kettle data type?

codek commented 9 years ago

A yes; I did not know about that. For sure; A value mapper would be adequate in that case, and that is what I would do. Thanks!

rpbouman commented 9 years ago

Guys, could you please check https://github.com/rpbouman/pentaho-pdi-plugin-jdbc-metadata/blob/master/example/column-data-type-to-kettle-type.ktr and see if this is an acceptable way to deal with it? If so, I'd like to close this issue.

cdeptula commented 9 years ago

Yes, I think that ValueMapper is a good enough solution. It is worth noting that depending on the database Pentaho may treat the same database type as different Pentaho types. For example if MySQL returns a binary data type, Pentaho treats it as a String, but will treats binary from other databases as a Binary.

The best solution in this circumstance would be to use a table input where you select * from table where 1=0, followed by a metadata structure of stream to get the Pentaho types rather than using the JDBC Metadata plugin followed by a value mapper.

It would be ideal if this plugin was able to return the Kettle types, since at least in my experience a frequent use of this plugin is for MetaData Injection where you often need to know the Pentaho type, but due to Pentaho's implementation of mapping the types it does not fit well within this plugin.

rpbouman commented 9 years ago

@cdeptula Ok, thanks for that input. Well if Kettle has some way of mapping datatypes that cannot be solved with the simple value mapper step, then maybe it is worth figuring out what can be done to make this process easier.

That said, I'm reluctant to hard-wired type mapping for the target "kettle" - some people might want to use this step to generate or drive ETL against Hadoop and thus desire HIVE or Impala flavoured data type mapping. I'm sure that in a year time, more targets could be added to the list. The plugin was designed with the aim to report metadata about a source RDBMS connection - not translating towards some other kind of datasource. Not that that isn't a great and interesting goal, but there are simply too many targets.

So I understand the need, but I'm just not sure whether the type mapping belongs in this step, just because the target happens to be Kettle.

Does that make sense? If not, don't hesitate to try and convince me. I'm really on the fence on this one.

codek commented 9 years ago

I can confirm the transformation works fine

https://raw.githubusercontent.com/rpbouman/pentaho-pdi-plugin-jdbc-metadata/master/example/column-data-type-to-kettle-type.ktr

In fact this solution is better that putting it into the step because in the case that the driver gets it wrong you have full control - and I've definitely seen issues especially with binary data types and DB2 where drivers report them as string etc!

It also allows you to decide whether to use Timestamp or Date etc.

So yes, happy that this be closed now, sorry for the slow reply, I err, forgot!

rpbouman commented 9 years ago

Hi Dan! No worries :) Happy to hear this solves it. Best regards, Roland.