pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

sql values returned incorrectly. Worked in 0.10.0 doesn't in 0.11.0 #3763

Closed JasonCrowe closed 11 years ago

JasonCrowe commented 11 years ago
Code

name_q = "SELECT cust_num, name from custaddr" result = CONNECTION.execute(name_q) rows = result.fetchall()

COLUMNS = ['cust_num', 'name'] name_df = DataFrame(rows, columns=COLUMNS)

Code end

This code worked without fail on 0.10, but started failing as soon as I updated to 0.11.0. It gives me the following error.


ValueError Traceback (most recent call last)

in () 5 6 COLUMNS = ['cust_num', 'name'] ----> 7 name_df = DataFrame(rows, columns=COLUMNS) 8 9 #name_df[:5] c:\Python27\lib\site-packages\pandas\core\frame.pyc in **init**(self, data, index, columns, dtype, copy) 426 else: 427 mgr = self._init_ndarray(data, index, columns, dtype=dtype, --> 428 copy=copy) 429 else: 430 mgr = self._init_ndarray(data, index, columns, dtype=dtype, c:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_ndarray(self, values, index, columns, dtype, copy) 556 columns = _ensure_index(columns) 557 --> 558 return create_block_manager_from_blocks([ values.T ], [ columns, index ]) 559 560 def _wrap_array(self, arr, axes, copy=False): c:\Python27\lib\site-packages\pandas\core\internals.pyc in create_block_manager_from_blocks(blocks, axes) 1814 blocks = [ getattr(b,'values',b) for b in blocks ] 1815 tot_items = sum(b.shape[0] for b in blocks) -> 1816 construction_error(tot_items,blocks[0].shape[1:],axes) 1817 1818 def create_block_manager_from_arrays(arrays, names, axes): c:\Python27\lib\site-packages\pandas\core\internals.pyc in construction_error(tot_items, block_shape, axes) 1797 raise ValueError("Shape of passed values is %s, indices imply %s" % ( 1798 tuple(map(int, [tot_items] + list(block_shape))), -> 1799 tuple(map(int, [len(ax) for ax in axes])))) 1800 1801 ValueError: Shape of passed values is (1, 50178), indices imply (2, 50178) If I remove the named columns with the following code: ###### ####### Code name_q = "SELECT cust_num, name from custaddr" result = CONNECTION.execute(name_q) rows = result.fetchall() # COLUMNS = ['cust_num', 'name'] name_df = DataFrame(rows)#, columns=COLUMNS) name_df[:5] ###### ####### Code end I get: 0 0 [C001505, @@STOKES GENERAL STORE] 1 [C002815, 3R DISTRIBUTING LLC] 2 [C000032, 4 C'S EXPERT SHARPENING CO] 3 [C000033, 422 SHOES] 4 [C000034, 5 L ENTERPRISES] in 0.10 this would return two columns of data rather than one column of list.
cpcloud commented 11 years ago

to make sure that you don't have one row that is longer or shorter than the rest try the following (git master (v0.11.1 soon-to-be-released) assigns None or NaN depending on the type when this happens, I can't remember if that was different in 0.11.0)

import numpy as np
np.all(np.array(map(len, rows)) == len(rows[0]))

what is the output?

cpcloud commented 11 years ago

if you get True, then this might be a bug that has been fixed: I can't replicate with what I think your data look like. I'm assuming a list of tuples of strings or a list of lists of strings. Is that correct?

JasonCrowe commented 11 years ago

The output was true. I am returning a record set from SQL Server that is made up of two string fields.

I have downgraded to 0.10.1 and the problem goes away.

cpcloud commented 11 years ago

hmm ok i will take a look

cpcloud commented 11 years ago

@JasonCrowe Okay, here's what i'm doing to try and reproduce this, being that you probably cannot give me the data. Let me know if the sql that creates your tables is different, e.g., did I use the correct types?

import sqlite3 as sql
from pandas.util.testing import rands # random strings
nameq = 'SELECT cust_num, name from custaddr'
name_create = 'CREATE TABLE custaddr (cust_num text, name text)'
conn = sql.connect(':memory:')
cursor = conn.cursor()
cursor.execute(name_create)
conn.commit()
strlen = 10
numrecs = 1000
data = [(rands(strlen), rands(strlen)) for _ in xrange(numrecs)]
cursor.executemany('INSERT INTO custaddr VALUES (?,?)', data)
conn.commit()
result = conn.execute(nameq)
rows = result.fetchall()
cols = 'cust_num', 'name'
df = DataFrame(rows, columns=cols)

This works as expected on v0.10.1.

jreback commented 11 years ago

@cpcloud of course may be sql server vs other flavors issue

cpcloud commented 11 years ago

oh ok. i have never used sql (beyond examples) before...so i'm basically a n00b when it comes to pitfalls...should i keep going?

JasonCrowe commented 11 years ago

That looks right as far as types.

Here is a clip of the actual data: C001505, @@STOKES GENERAL STORE C002815, 3R DISTRIBUTING LLC C000032, 4 C'S EXPERT SHARPENING CO C000033, 422 SHOES C000034, 5 L ENTERPRISES

jreback commented 11 years ago

@cpcloud fyi....try pd.read_sql(....) as well, this does some type conversions

JasonCrowe commented 11 years ago

Using the following code it works fine in 0.11.0

name_q = "SELECT cust_num, name from custaddr " name_df = sql.read_frame(name_q, CONN)

jreback commented 11 years ago

@JasonCrowe is rows (when it comes out of fetchall) list-like? an actual list? what if you do

DataFrame(list(rows),columns=COLUMNS)?

cpcloud commented 11 years ago

was just about to ask that...

JasonCrowe commented 11 years ago

If I limit the query to the top 10 and print rows I get:

[(u'C000001', u'WAL-MART CORPORATE'), (u'C000002', u'CANADIAN TIRE CORPORATE'), (u'C000003', u'WAL-MART NON ASN'), (u'C000004', u'CANADIAN TIRE'), (u'C000005', u'HOME DEPOT'), (u'C000006', u'MEIJER'), (u'C000007', u'SAMS'), (u'C000008', u'TARGET'), (u'C000009', u'WAL-MART'), (u'C000010', u'WAL-MART CANADA')]

I will try DataFrame(list(rows),columns=COLUMNS)

JasonCrowe commented 11 years ago

Result:

ValueError: Shape of passed values is (1, 10), indices imply (2, 10)

cpcloud commented 11 years ago

this is with exactly the same data u show above?

jreback commented 11 years ago

This works, must be something else

In [2]: l
Out[2]: 
[(u'C000001', u'WAL-MART CORPORATE'),
 (u'C000002', u'CANADIAN TIRE CORPORATE'),
 (u'C000003', u'WAL-MART NON ASN'),
 (u'C000004', u'CANADIAN TIRE'),
 (u'C000005', u'HOME DEPOT'),
 (u'C000006', u'MEIJER'),
 (u'C000007', u'SAMS'),
 (u'C000008', u'TARGET'),
 (u'C000009', u'WAL-MART'),
 (u'C000010', u'WAL-MART CANADA')]

In [3]: DataFrame(l,columns=list('AB'))
Out[3]: 
         A                        B
0  C000001       WAL-MART CORPORATE
1  C000002  CANADIAN TIRE CORPORATE
2  C000003         WAL-MART NON ASN
3  C000004            CANADIAN TIRE
4  C000005               HOME DEPOT
5  C000006                   MEIJER
6  C000007                     SAMS
7  C000008                   TARGET
8  C000009                 WAL-MART
9  C000010          WAL-MART CANADA
cpcloud commented 11 years ago

yep that's what i get 2

JasonCrowe commented 11 years ago

Yes, they both use the same query:

SELECT top 10 cust_num, name from custaddr WHERE name is not null and cust_seq = 0 order by cust_num

cpcloud commented 11 years ago

what is cust_seq? column of ints?

JasonCrowe commented 11 years ago

Yes, but the result is the same with or without the where sql.

cpcloud commented 11 years ago

you should programmatically check the type of each tuple element of the list and then each element of that tuple (even though looks like they are all unicode objects) to make sure that there's no weirdness in there. fyi using print can hide things sometimes, e.g., print [] and print str([]) look the same in the console but they are different objects...

cpcloud commented 11 years ago

@JasonCrowe u can do what @jreback just showed correct? let's get a baseline here...

JasonCrowe commented 11 years ago

DataFrame(l,columns=list('AB')) shows the same results as he posted.

cpcloud commented 11 years ago

ok that's good. rows[:10] == l should return True...does it?

cpcloud commented 11 years ago

maybe wrap it in sorted as needed...

jreback commented 11 years ago

try this:

while(True):
   r = rows[0:100]
   try:
       DataFrame(r,columns=list('AB'))
   except:
       print r
    rows = rows[100:]

see where it breaks

JasonCrowe commented 11 years ago

Returns False

Will try while loop next.

cpcloud commented 11 years ago

at least one element of the list, then, is not a tuple...

jreback commented 11 years ago

yep....weird

cpcloud commented 11 years ago

try type(rows[list(map(lambda x: isinstance(x, tuple), rows)).index(False)]) to get the type of the first non tuple in rows.

JasonCrowe commented 11 years ago

while:

[(u'C000001', u'WAL-MART CORPORATE'), (u'C000002', u'CANADIAN TIRE CORPORATE'), (u'C000003', u'WAL-MART NON ASN'), (u'C000004', u'CANADIAN TIRE'), (u'C000005', u'HOME DEPOT'), (u'C000006', u'MEIJER'), (u'C000007', u'SAMS'), (u'C000008', u'TARGET'), (u'C000009', u'WAL-MART'), (u'C000010', u'WAL-MART CANADA')] [] [] [] [] [] []

JasonCrowe commented 11 years ago

In [14]:type(rows[list(map(lambda x: isinstance(x, tuple), rows)).index(False)])

Out[14]:pyodbc.Row

cpcloud commented 11 years ago

there you go

cpcloud commented 11 years ago

i knew someone had put a sneaky repr in somewhere...

JasonCrowe commented 11 years ago

So.... Is this a bug in pyodbc?

jreback commented 11 years ago

read_sql handles this correctly

@cpcloud maybe we should make a little function in utils somewhere to 'check' input and report where it is bad (not part of the construction code) but can be called manually

cpcloud commented 11 years ago

don't think so, but i might argue that they shouldn't repr to a Python builtin so u need to convert those objects to tuples or lists. a quick glance of the Row docs for pyodbc suggests that Row implements the collections.Sequence protocol (__getitem__, __setitem__, __len__, and __contains__, IIRC) which means you can probably do map(tuple, rows) until we do what @jreback suggests

jreback commented 11 years ago

@JasonCrowe fundamentally the issue is that pandas doesn't coerce list like objects eg you could actually store a pyodbc row object in a frame (of course the utility is limited in that case)

@cpcloud maybe we could deal with coercing collections.Sequence though

cpcloud commented 11 years ago

@jreback hm why didn't list(rows) work? that's essentially what _safe_fetch is doing...

jreback commented 11 years ago

it's not recursively deep just a top level conversipn

cpcloud commented 11 years ago

oh i c, it's from_records that makes it work

cpcloud commented 11 years ago

no that's not it...nvm

cpcloud commented 11 years ago

wait! it looks like it is: _list_to_arrays is doing it i believe...

jreback commented 11 years ago

@cpcloud a bit deeper...I know where it is....see the refernces issue

cpcloud commented 11 years ago

ok. cool. i'm having a helluva time trying to connect to a database to test this...is there no way to do this with pyodbc?

jreback commented 11 years ago

closing this....enhancement issue in #3783