pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.29k stars 17.8k forks source link

BUG: data written with to_sql legacy mode (sqlite/mysql) not persistent #6846

Closed Acanthostega closed 10 years ago

Acanthostega commented 10 years ago

UPDATE: this issue seemed actually to be a bug in the new legacy code so all databases written were only written in memory and not really committed to the database itself (see discussion below).


Hi everybody,

I still have a problem with writing data into a SQL database. With the following example, the resulting database file isn't written, but the structure of the table is created (I assume a pylab environment set with ipython...):


>>> import pandas as pd
>>> import sqlite3

>>> data = pd.DataFrame({"galid":randint(2**63-1, size=100), "objid": randint(2**63-1, size=100), "alpha": rand(100)})

>>> conn = sqlite3.connect("/tmp/bidulechouette.db")

>>> data.to_sql("DATA", conn, if_exists="replace", flavor="sqlite", index=False)

>>> # just to be sure...
>>> conn.close()
>>> conn = sqlite3.connect("/tmp/bidulechouette.db")

>>> result = pd.read_sql("SELECT objid FROM DATA;", conn)
>>> len(result)
0

I tried it on two different systems with different versions of sqlite3 and python(3.2 and 3.4).

If I kill ipython and redo the same without the if_exists option on the same database file, it complains that the table already exists, even if I manually remove the database file of sqlite3. This lets me suppose, that somewhere, a reference to the database is kept, but it's weird because ipython is killed... Or the file in which it write isn't the good one, since with a lot of data, it takes a long time as it is writing the data somewhere, explaining the problem of existing table in a deleted database...

INSTALLED VERSIONS

commit: None python: 3.4.0.final.0 python-bits: 64 OS: Linux OS-release: 3.13.8-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.utf8

pandas: 0.13.1-605-g61ea0a3 Cython: 0.20.1 numpy: 1.8.1 scipy: 0.13.3 statsmodels: None IPython: 2.0.0 sphinx: 1.2.2 patsy: None scikits.timeseries: None dateutil: 2.2 pytz: 2014.2 bottleneck: None tables: 3.1.0 numexpr: 2.3.1 matplotlib: 1.3.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None bq: None apiclient: None rpy2: None sqlalchemy: None pymysql: None psycopg2: None

jorisvandenbossche commented 10 years ago

See also #6843.

I can't try to reproduce it at the moment (I get OverflowError: Python int too large to convert to C long on python 2.7, windows, 64 bit, I think something windows-specific (C long is 32 bit on windows)).

But can you try it also with the new sqlalchemy-based functions? It should be something like:

from sqlalchemy import create_engine
engine = create_engine('sqlite:////tmp/bidulechouette.db')
data.to_sql("DATA", engine, if_exists="replace", index=False)

to see if you have the same issue with this.

jreback commented 10 years ago

this is for windows:

numpy on windows is not friendly to large ints with the random number generation. might be a bug, as their is no way to specify the dtype upfront (e.g. randint doesn't take a dtype argument). could be an implementation detail as well.

so I think you would have to generate int32's cast them to int64 then multiply to get int64 like numbers (which pandas will handle).

>>> np.random.randint(2**63-1,size=100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 871, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:6682)
OverflowError: Python int too large to convert to C long

>>> np.random.randint(2**31-1,size=100)
array([1065151596,  141587709, 1034731056, 1421678991, 1113510421,
       1053591656,  721945385,  417353460,  588311441,  451124932,
       2011103288,  606400194, 1767870596, 1577034881,  801116609,
       1094153300,  681848258, 1216074099,  828001710, 1818981627,
        583590097, 1893057734, 2041021821, 1719114523,  659638699,
         36901870, 1500212655,   80923766, 1147783744, 1802598765,
        148074008,  637763026, 2105303772, 1512947348, 1699075859,
        842315446, 1365873318,  820098773,  676695949, 1474271470,
        335374748, 1477530337, 1537800347,   87529923,  516659836,
       1551506257, 1743913044, 1156954863, 1286249176, 1279615765,
       1740947158,  448293258, 1524004071, 1527588882,  795421749,
       1231177972,  264830723,  589977924,  762902232, 1805756724,
        696292676,  915029891, 1311767606,  247202537, 1929326874,
       1238932552, 2026394145,  394577189, 1241699925,  910466670,
       1911660830, 1717398704,  761388787, 1527405720, 1132149302,
        930729795, 1350701423,   50980838, 1224099292,  705686236,
       1529564451,  747506807, 1142508936,  224622576, 1217950640,
       1208551918,  365819633,  456296798,  715871164, 1228926134,
       2030700788, 1192490214, 1276642215, 1186011869, 1479536059,
        334653481, 1961647202, 1838092157,  645948253, 1868626295])
jreback commented 10 years ago

this DOES work with @jorisvandenbossche method (python 3.4,numpy 1.8, on 64-bit linux), maybe a problem with sqlite3 dtype conversion of int64's (on the legacy)?

jorisvandenbossche commented 10 years ago

I can't reproduce this on Windows (python 2.7, 64-bit), but both with the legacy and new slqalchemy method (using the conversion of int32 to int64 and multiplying as suggested by @jreback above).

The problem with the database that is not deleted properly, do you also have this with other tables (eg with just a toy example with small values)? Or does it only occur with this example with these large values/dtype problem?

Acanthostega commented 10 years ago

I confirm that it works with the new SQLAlchemy method, but not with the legacy one.

This is not a specific problem to long integers (int64) since I can reproduce the problem with 28 instead of 263-1.

If I don't close the return of the connect function of sqlite3 in the middle, data seems to be read in the memory. But once the ipython console is closed, the database file is still empty (but with table structure).

cpcloud commented 10 years ago

@Acanthostega can you try by just writing a small test case and running nosetests the_test.py to see if there's an ipython issue? thx

Acanthostega commented 10 years ago

I tried with nosetests and "simple" python interpreter, but same result...

jorisvandenbossche commented 10 years ago

@Acanthostega Can you post the test you used?

Acanthostega commented 10 years ago

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import sqlite3

data = pd.DataFrame(
    {
        "galid": np.random.randint(2**8, size=100),
        "truc": np.random.rand(100),
    }
)

conn = sqlite3.connect("/tmp/machinchose.db")

data.to_sql("DATA", conn, if_exists="replace", index=False, flavor="sqlite")

conn.close()
conn = sqlite3.connect("/tmp/machinchose.db")

result = pd.read_sql("SELECT galid FROM DATA;", conn)

print(len(result))

I'm not familiar with nosetests so I did a simple script to check the resulting database, and used the python interpreter to check. I can improve the test if you want but I have to read a little the doc before...

cpcloud commented 10 years ago

just write a function called test_database then run it with nosetests the_file_containing_test_database.py

jorisvandenbossche commented 10 years ago

OK, so I think the problem is not with too large ints, or with mixed dtype or something like that, but with not correctly closing the database connection. As if you don't do the conn.close() in the above script, then it works for me.

Acanthostega commented 10 years ago

@jorisvandenbossche So, do you reproduce the problem with closing the connection to the database? Since, if I don't close the conn, the data is read in the memory, but nothing is present in the database file.

@cpcloud I rewrote the test:

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import sqlite3

def test_database():
    data = pd.DataFrame(
        {
            "galid": np.random.randint(2**8, size=100),
            "truc": np.random.rand(100),
        }
    )

    conn = sqlite3.connect("/tmp/machinchose.db")

    data.to_sql("DATA", conn, if_exists="replace", index=False, flavor="sqlite")

    conn.close()
    conn = sqlite3.connect("/tmp/machinchose.db")

    result = pd.read_sql("SELECT galid FROM DATA;", conn)

    print(len(result))
    assert len(result) == len(data)

Same problem.

jorisvandenbossche commented 10 years ago

Yes, indeed I reproduce the problem when closing and reopening the connection manually, so the data are indeed never written to the database on file. I think there is a conn.commit() (see https://docs.python.org/2/library/sqlite3.html) missing in the insert function in the legacy code (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L779)

jorisvandenbossche commented 10 years ago

There was a conn.commit() in the previous version in 0.13.1 (https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L240), so I think this is a regression.

Acanthostega commented 10 years ago

@jorisvandenbossche I added a call to commit() in the insert method, and the problem seems to disappear. Do you want that I propose a patch or you think it's better if you do it yourself? I think that a test for the writing and reading in the SQL database with legacy should be added too, with the connection to the database closed before reading, to avoid any future regression. But I don't know sufficiently how pandas works in internal to propose one...

jorisvandenbossche commented 10 years ago

@Acanthostega Good, if you want, certainly try to put up a PR! For a test, that would certainly be good, I only don't know how to do this with sqlite, as we use a in-memory sqlite database in the tests, and I think this will lose all content when closing the connection? So we can't really test this bug that way. For mysql we should certainly test it.

Acanthostega commented 10 years ago

@jorisvandenbossche Ok! If I have enough time, I will try to do it this week end!

jorisvandenbossche commented 10 years ago

@Acanthostega OK, let know how it goes!

I updated the issue title/description to reflect the dicussion.