pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.53k stars 17.88k forks source link

Memory Error during read_sql_table and read_csv #24682

Closed SimonOsipov closed 5 years ago

SimonOsipov commented 5 years ago

Code Sample

engine = sqlalchemy.create_engine('mysql+pymysql://root:@localhost:3306/gds')
search_df = pd.read_sql_table('search', engine)

or

dtype = {
    'search_id' : int,
    'job_count_total' : int,
    'job_count_done' : int,
    'city_id_start' : int,
    'city_id_end' : int,
    'date_start' : str,
    'datetime_create' : str,
    'agent_id' : int,
    'ride_segment_found_cnt' : int
}
search_df = pd.read_csv('search.csv', sep=',', dtype=dtype)

Specs:

RAM 32 GB CPU IntelCore i7 4GHz Python 3.7.1 pandas 0.23.4 OS Windows 10 (x64)

Setup

Data is in MySQL DB (5.6.38), 9 columns (7 int, 2 DateTime). DB is on the local machine, so no internet bandwidth issues. 22 mil. rows of data. Same in CSV - 1.5 GB

Expected Output

I expect to get dataframe but face MemoryError or infinite time to complete the script. When I monitor the Memory usage via Task Manager, it barely goes up to 1.5 GB for Python, I have more than 15 by the time I get an error.

Problem description

When I trying to connect directly to DB, it takes infinite time to get the data out, I ran the script and it did not finish in 30 minutes. I went to SO and got the impression that this is not a good way to handle this amount of data via a connector, so I exported this DB to CSV.

When I tried to work with CSV, I got MemoryError. I tried to specify both engines, c and python, different chunk sizes, low_memory as True and False, specified dtypes and not, but still getting MemoryError.

I spotted only two difference:

If I parsing without chunks that I get MemoryError on parsing.

When I am parsing in chunks - on concatenation into one DF.

Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.

Here is an error message:

pandas.errors.ParserError: Error tokenizing data. C error: out of memory

What am I facing here? Is this some kind of limitation to Pandas and I need to switch to Spart or Dask or I am somewhere wrong?

WillAyd commented 5 years ago

Unfortunately there is not a look to help with as nothing here is reproducible. I get the impression given the amount of memory on your computer that you aren't expecting to get that error with a CSV file that is only 1.5 GB, though keep in mind that the memory used to build Python objects in memory is not the same as hard disk storage.

Do you have a way of isolating this to part of the data?

SimonOsipov commented 5 years ago

Yes, you get the impression right, I have plenty RAM, but get Memory Error And, yes, we are talking about RAM, not hard disk storage, I know the difference.

What do you mean under "isolating this to part of the data"? You mean, create file with same columns but dummy data and share for testing?

Anyway, thanks for answering and support!

WillAyd commented 5 years ago

Yea the problem with the code provided is that it isn't "self-contained" meaning no one can copy / paste it to reproduce the issue and get the same error. It makes it extremely difficult if not impossible to help troubleshoot in that case.

SimonOsipov commented 5 years ago

I will close the issue, problem solved (by setting chunk size = 400_000 rows)