DataFrame.to_sql generated text field that could not be used by "group by". (MS sql)

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.26k stars 17.8k forks source link

DataFrame.to_sql generated text field that could not be used by "group by". (MS sql) #7957

Open NathanielCapital opened 10 years ago

NathanielCapital commented 10 years ago

In _sqlalchemy_type function, pandas import Text type from sqlalchemy as the default type when generating text field in SQL table, which is transformed into "text" type in microsoft sql. This type could not be used by "group by" in SQL query. SQL query with a group by clause with the field generate an error msg like:

Msg 306, Level 16, State 2, Line 3 The text, ntext, and image data types cannot be compared or sorted, except when using IS NULL or LIKE operator.

I tried default text field to be the String, which works for MS sql.

jorisvandenbossche commented 10 years ago

@NathanielCapital Did it work without specifying a max length? (so with String instead of String(some_number)). I suppose a varchar type will be created for MS SQL, but eg http://docs.sqlalchemy.org/en/rel_0_9/dialects/mssql.html#sqlalchemy.dialects.mssql.VARCHAR seems to imply a length should be given.

For strings it seems indeed better to use String (which will be converted to character varying in general I think). But would there be other consequences of changing Text to String? @mangecoeur @danielballan @hayd any idea?

NathanielCapital commented 10 years ago

@jorisvandenbossche Yes, it works without specifying a max length. I just made it return String type other than Text. The table text field generated is nvarchar(max) on my Microsoft SQL 2008 R2.

aergener commented 10 years ago

@jorisvandenbossche Text is deprecated in the SQL Server, and has limited functionality. (see links below) Nvarchar and varchar types are preferable. Since this may vary between SQL flavors, perhaps there could be a setter function to set the string representation in pandas.io.sql.PandasSQLTable._sqlalchemy_type.

http://msdn.microsoft.com/en-us/library/ms187993.aspx http://stackoverflow.com/questions/564755/sql-server-text-type-vs-varchar-data-type

mangecoeur commented 10 years ago

@aergener it would be better to avoid requiring setters like this - I'm confident we can get this to work without encouraging hacks, though anyone who really wants to change behaviour is free to subclass PandasSQLTable and override whatever they like.

Seems there are 2 options: http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.Unicode and http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.String

As well as explicitly setting nvarchar etc. (NVARCHAR is automatically subclass of Unicode it seems). Personally I like the idea of using Unicode since it fits well with Python3 going forwards where people expect all text to be unicode. Not sure what that means for python2 however, might then require explicit conversion so it might not be worth the trouble :/

jorisvandenbossche commented 10 years ago

I read in the SQLAlchemy docs (http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.Unicode):

When using the Unicode type, it is only appropriate to pass Python unicode objects, and not plain str. If a plain str is passed under Python 2, a warning is emitted.

In that case, it seems to me that we should use String? Or at least check for the type and maybe return something different (String/Unicode) dependent on type or python 2/3? I suppose always using the String type will work as long as your (unicode) data are also valid ascii? But when having non-ascii characters this will lead to an error?

jorisvandenbossche commented 10 years ago

With lib.infer_dtype we could check for string/unicode in principle. And then for python 2 return the appropriate String or Unicode, and for python 3 always Unicode ?

jreback commented 10 years ago

careful as this must scan all data

jorisvandenbossche commented 10 years ago

@jreback yes, but we already do this .. (if it is an object type column, to check for date/time: https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L788)

jreback commented 10 years ago

@jorisvandenbossche right ok

jorisvandenbossche commented 9 years ago

Problem with changing Text to String is that for some databases, we have to specify a maximum length. For MSSQL or PostgreSQL this does not seem to be a problem (they accept varchar without specifying a maximum lenght, but for MySQL this is required. When you try to use just String, you get the following error:

CompileError: (in table 'test_string', column 'string'): VARCHAR requires a length on dialect mysql

which is what is actually also explained in the sqlalchemy docs of String: http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.String

jorisvandenbossche commented 9 years ago

I know this does not solve the issue, but after PR #8926, you can specify a sqlalchemy type per column, so to override the default chosen one. So at least that gives a way to deal with this issue for now.