Closed Acanthostega closed 10 years ago
See also #6843.
I can't try to reproduce it at the moment (I get OverflowError: Python int too large to convert to C long
on python 2.7, windows, 64 bit, I think something windows-specific (C long is 32 bit on windows)).
But can you try it also with the new sqlalchemy-based functions? It should be something like:
from sqlalchemy import create_engine
engine = create_engine('sqlite:////tmp/bidulechouette.db')
data.to_sql("DATA", engine, if_exists="replace", index=False)
to see if you have the same issue with this.
this is for windows:
numpy on windows is not friendly to large ints with the random number generation. might be a bug, as their is no way to specify the dtype upfront (e.g. randint doesn't take a dtype argument). could be an implementation detail as well.
so I think you would have to generate int32's cast them to int64 then multiply to get int64 like numbers (which pandas will handle).
>>> np.random.randint(2**63-1,size=100)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 871, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:6682)
OverflowError: Python int too large to convert to C long
>>> np.random.randint(2**31-1,size=100)
array([1065151596, 141587709, 1034731056, 1421678991, 1113510421,
1053591656, 721945385, 417353460, 588311441, 451124932,
2011103288, 606400194, 1767870596, 1577034881, 801116609,
1094153300, 681848258, 1216074099, 828001710, 1818981627,
583590097, 1893057734, 2041021821, 1719114523, 659638699,
36901870, 1500212655, 80923766, 1147783744, 1802598765,
148074008, 637763026, 2105303772, 1512947348, 1699075859,
842315446, 1365873318, 820098773, 676695949, 1474271470,
335374748, 1477530337, 1537800347, 87529923, 516659836,
1551506257, 1743913044, 1156954863, 1286249176, 1279615765,
1740947158, 448293258, 1524004071, 1527588882, 795421749,
1231177972, 264830723, 589977924, 762902232, 1805756724,
696292676, 915029891, 1311767606, 247202537, 1929326874,
1238932552, 2026394145, 394577189, 1241699925, 910466670,
1911660830, 1717398704, 761388787, 1527405720, 1132149302,
930729795, 1350701423, 50980838, 1224099292, 705686236,
1529564451, 747506807, 1142508936, 224622576, 1217950640,
1208551918, 365819633, 456296798, 715871164, 1228926134,
2030700788, 1192490214, 1276642215, 1186011869, 1479536059,
334653481, 1961647202, 1838092157, 645948253, 1868626295])
this DOES work with @jorisvandenbossche method (python 3.4,numpy 1.8, on 64-bit linux), maybe a problem with sqlite3 dtype conversion of int64's (on the legacy)?
I can't reproduce this on Windows (python 2.7, 64-bit), but both with the legacy and new slqalchemy method (using the conversion of int32 to int64 and multiplying as suggested by @jreback above).
The problem with the database that is not deleted properly, do you also have this with other tables (eg with just a toy example with small values)? Or does it only occur with this example with these large values/dtype problem?
I confirm that it works with the new SQLAlchemy method, but not with the legacy one.
This is not a specific problem to long integers (int64) since I can reproduce the problem with 28 instead of 263-1.
If I don't close the return of the connect function of sqlite3 in the middle, data seems to be read in the memory. But once the ipython console is closed, the database file is still empty (but with table structure).
@Acanthostega can you try by just writing a small test case and running nosetests the_test.py
to see if there's an ipython issue? thx
I tried with nosetests and "simple" python interpreter, but same result...
@Acanthostega Can you post the test you used?
#!/usr/bin/env python3
#-*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import sqlite3
data = pd.DataFrame(
{
"galid": np.random.randint(2**8, size=100),
"truc": np.random.rand(100),
}
)
conn = sqlite3.connect("/tmp/machinchose.db")
data.to_sql("DATA", conn, if_exists="replace", index=False, flavor="sqlite")
conn.close()
conn = sqlite3.connect("/tmp/machinchose.db")
result = pd.read_sql("SELECT galid FROM DATA;", conn)
print(len(result))
I'm not familiar with nosetests so I did a simple script to check the resulting database, and used the python interpreter to check. I can improve the test if you want but I have to read a little the doc before...
just write a function called test_database
then run it with nosetests the_file_containing_test_database.py
OK, so I think the problem is not with too large ints, or with mixed dtype or something like that, but with not correctly closing the database connection.
As if you don't do the conn.close()
in the above script, then it works for me.
@jorisvandenbossche So, do you reproduce the problem with closing the connection to the database? Since, if I don't close the conn
, the data is read in the memory, but nothing is present in the database file.
@cpcloud I rewrote the test:
#!/usr/bin/env python3
#-*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import sqlite3
def test_database():
data = pd.DataFrame(
{
"galid": np.random.randint(2**8, size=100),
"truc": np.random.rand(100),
}
)
conn = sqlite3.connect("/tmp/machinchose.db")
data.to_sql("DATA", conn, if_exists="replace", index=False, flavor="sqlite")
conn.close()
conn = sqlite3.connect("/tmp/machinchose.db")
result = pd.read_sql("SELECT galid FROM DATA;", conn)
print(len(result))
assert len(result) == len(data)
Same problem.
Yes, indeed I reproduce the problem when closing and reopening the connection manually, so the data are indeed never written to the database on file. I think there is a conn.commit()
(see https://docs.python.org/2/library/sqlite3.html) missing in the insert
function in the legacy code (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L779)
There was a conn.commit()
in the previous version in 0.13.1 (https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L240), so I think this is a regression.
@jorisvandenbossche I added a call to commit()
in the insert
method, and the problem seems to disappear. Do you want that I propose a patch or you think it's better if you do it yourself? I think that a test for the writing and reading in the SQL database with legacy should be added too, with the connection to the database closed before reading, to avoid any future regression. But I don't know sufficiently how pandas works in internal to propose one...
@Acanthostega Good, if you want, certainly try to put up a PR! For a test, that would certainly be good, I only don't know how to do this with sqlite, as we use a in-memory sqlite database in the tests, and I think this will lose all content when closing the connection? So we can't really test this bug that way. For mysql we should certainly test it.
@jorisvandenbossche Ok! If I have enough time, I will try to do it this week end!
@Acanthostega OK, let know how it goes!
I updated the issue title/description to reflect the dicussion.
UPDATE: this issue seemed actually to be a bug in the new legacy code so all databases written were only written in memory and not really committed to the database itself (see discussion below).
Hi everybody,
I still have a problem with writing data into a SQL database. With the following example, the resulting database file isn't written, but the structure of the table is created (I assume a pylab environment set with ipython...):
I tried it on two different systems with different versions of sqlite3 and python(3.2 and 3.4).
If I kill ipython and redo the same without the if_exists option on the same database file, it complains that the table already exists, even if I manually remove the database file of sqlite3. This lets me suppose, that somewhere, a reference to the database is kept, but it's weird because ipython is killed... Or the file in which it write isn't the good one, since with a lot of data, it takes a long time as it is writing the data somewhere, explaining the problem of existing table in a deleted database...
INSTALLED VERSIONS
commit: None python: 3.4.0.final.0 python-bits: 64 OS: Linux OS-release: 3.13.8-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.utf8
pandas: 0.13.1-605-g61ea0a3 Cython: 0.20.1 numpy: 1.8.1 scipy: 0.13.3 statsmodels: None IPython: 2.0.0 sphinx: 1.2.2 patsy: None scikits.timeseries: None dateutil: 2.2 pytz: 2014.2 bottleneck: None tables: 3.1.0 numexpr: 2.3.1 matplotlib: 1.3.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None bq: None apiclient: None rpy2: None sqlalchemy: None pymysql: None psycopg2: None