jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.52k stars 637 forks source link

Enable skipping columns of unsupported sql types during import with db.read() #776

Open J4nJ4nsen opened 4 years ago

J4nJ4nsen commented 4 years ago

If a SqlResultSetReader encounters a column of an unmapped type in the iterated ResultSet, its values cannot be converted and thus, it throws a SqlException with an appropriate error message.

The same behaviour would be expected if the column data format is not supported (e.g. for CHAR BYTE), but the reader seems to trigger some kind of an "auto guess" conversion mechanism, which actually tries really strange stuff (for CHAR BYTE it tries to convert to boolean!?), which obviously makes no sense and thus the underlying odbc/jdbc/whatever class must throw some kind of ConversionException, which then could be catched or has to be mitigated by some prechecks. As mentioned above, those checks not work correctly for Sql Type CHAR with BYTE encoding, nor there is an exception handling for such a case.

In most cases, exclusion of those columns could be achieved by a modified select statement, but there exists cases in which u don't know what column types can be expected.

A consequent idea would be, to allow skipping of db columns with unknown data types / formats during import, which has lead me to the tech.tablesaw.columns.SkipColumnType class. As the name suggests, i expected a value type for the JDBC to Java mapping, which has a null create method or sth like this, so that this column would be skipped during the Column/Table creation process. It seems that currently nothing is skipped or excluded or sth else which would avoid a hard system exit in the above mentioned context, alltough it breaks table creation/import with a self thrown UnsupportedOperationException which indicates that a SKIP column cannot be created. In the end, i guess creating a column which should be skipped would not be what someone comes in mind if he tries to figure out how to filter out unsupported types in a query result set.

If i don't understand the real purpose of this class and/or there is another way to get to what i was searching for please let me know and close this issue.

Hope i could help, Jan

benmccann commented 4 years ago

Here's the type mapping. The first step is that Java maps the database type to a Java SQL Type. Then we pick a column type off of that. Can you tell me what Java SQL Type it's trying to convert CHAR BYTE to? It must be either BINARY, BOOLEAN, or BIT if we're choosing a BooleanColumn.

CHAR BYTE is a very uncommon datatype. Our test suite runs against H2 DB, which doesn't support this data type, so unfortunately it'll be difficult for us to test that it's handled correctly, but perhaps we can fix it if you're able to provide us enough info. Can you share the exception message that you're getting?

J4nJ4nsen commented 4 years ago

Sorry for the late answer, i had my focus on another module of our project. So here are the infos: The issue arrises when querying against a SapDB / maxDB Table which contains a foreign key relation to other db entries in another table, which are not correctly mapped by the db engine. In such a case, the query shows the corresponding internal SYSKEYs to the desired entries. This SYSKEY is simply a a column with 64bit hex values of type CHAR BYTE. The correct interpretation in java depends on the operating context (Byte[] or Char[]), but BOOLEAN makes not sense at all.

A workaround would be to display such columns as a StringColumn encoding the bytes to the ascii representation of the hex tuples lth:

More flexible would be letting the user choose which encoding should be used:

The exception message indicates that CHAR BYTE is not a correct value for a BooleanColumn. In debug mode i found out, that jdbc is giving back a BINARY Java SQL Type (which seems to be correct).

benmccann commented 4 years ago

Ah, yes, I see in the MySQL docs:

The CHAR BYTE data type is an alias for the BINARY data type.

I agree it doesn't make sense to use a BooleanColumn for that. We don't have any column type that could store a byte[]. Just changing it to a StringColumn is probably the easiest fix, so I went ahead and did that: https://github.com/jtablesaw/tablesaw/commit/d1348cde230b05e26fc3002f251ba828303e6c4e

Test it out if you get a chance and let me know if that fixes it or not. I'll close this for now assuming that it does fix the issue

Kasperx commented 2 years ago

Hi Thanks for commit... I'm another member of @J4nJ4nsen team and tested it with System.out.println(Table.read().db(rs).print()); and it does not work without customization :( I imported project to ide and changed code to this...

image

Just a dirty workaround to take unspecified object. The other else-if functions are not called :( So maybe the specific datatype is (still) missing.

Second problem: The resultset knows a remote connected database. But the while loop in class SqlResultSetReader doesn't find table content so nothing but table column names is displayed :( In debugger I can see correct (remote) url. Seems like while (resultSet.next()) {ignores it :(

Kasperx commented 2 years ago

Hi @benmccann Pls reopen this topic.

Kasperx commented 2 years ago

Thanks The commit doesn't fix the issue. Do you have more ideas? (Pls don't forget problem with remote connection. Is it valid to be subtopic?)

lwhite1 commented 2 years ago

@Kasperx

(Pls don't forget problem with remote connection. Is it valid to be subtopic?)

If it's a separate problem please create a new issue.

A consequent idea would be, to allow skipping of db columns with unknown data types / formats during import, which has lead me to the tech.tablesaw.columns.SkipColumnType class.

Yes, the SQL import logic should probably support SKIP the way CSV import does. More generally, it should probably support user defined control over the types. I say "probably" only because I haven't investigated enough to know if it's difficult or impossible.

For your use case, however, I don't see how SKIP could help. You seem to be say that you don't know when you will get a column of binary data so there would be no way to tell the system to skip a particular column. There's no facility in Tablesaw that tells the system to skip all the columns of a particular type since that's a rare use case.

it breaks table creation/import with a self thrown UnsupportedOperationException which indicates that a SKIP column cannot be created.

I can't really answer this since you're not proving any code, but SKIP is not a real column type so, no, you can't create a SKIP column. All SKIP does is tell the file importer to pass over a particular column without importing it. And as I mentioned, it is not supported in the DB import code currently.

The issue arrises when querying against a SapDB / maxDB Table which contains a foreign key relation to other db entries in another table, which are not correctly mapped by the db engine

It sounds like the issue is a bug in the database you're using. Tablesaw doesn't support binary data so I'm not sure how much effort should be put into making it possible to import binary data when for most users, the solution to this issue would be to change the select statement to exclude the binary columns.

A workaround would be to display such columns as a StringColumn encoding the bytes to the ascii representation of the hex tuples lth: 0x0F -> "0x0F" (char[] -> String)

Sure, but most forms of binary data (images, movies, Excel file contents, etc.) would not produce a string representation that was at all useful.

I'm another member of @J4nJ4nsen team and tested it with System.out.println(Table.read().db(rs).print()); and it does not work without customization :(

What happens?

Kasperx commented 2 years ago

Thanks for answer, sry for late response.

This code (dbc is an instance from an own class)

ResultSet rs = dbc.getResultSet("select name from line");
System.out.println(tech.tablesaw.api.Table.read().db(rs).printAll());
System.out.println(tech.tablesaw.api.Table.read().db(rs).print());

leads to this content :(

 NAME  |
--------
 NAME  |
--------

I would expect something like "21,78,30".

image