Closed SimonOsipov closed 5 years ago
Unfortunately there is not a look to help with as nothing here is reproducible. I get the impression given the amount of memory on your computer that you aren't expecting to get that error with a CSV file that is only 1.5 GB, though keep in mind that the memory used to build Python objects in memory is not the same as hard disk storage.
Do you have a way of isolating this to part of the data?
Yes, you get the impression right, I have plenty RAM, but get Memory Error
And, yes, we are talking about RAM, not hard disk storage, I know the difference.
What do you mean under "isolating this to part of the data"? You mean, create file with same columns but dummy data and share for testing?
Anyway, thanks for answering and support!
Yea the problem with the code provided is that it isn't "self-contained" meaning no one can copy / paste it to reproduce the issue and get the same error. It makes it extremely difficult if not impossible to help troubleshoot in that case.
I will close the issue, problem solved (by setting chunk size
= 400_000 rows)
Code Sample
or
Specs:
RAM 32 GB CPU IntelCore i7 4GHz Python 3.7.1 pandas 0.23.4 OS Windows 10 (x64)
Setup
Data is in MySQL DB (5.6.38), 9 columns (7
int
, 2DateTime
). DB is on the local machine, so no internet bandwidth issues. 22 mil. rows of data. Same in CSV - 1.5 GBExpected Output
I expect to get
dataframe
but faceMemoryError
or infinite time to complete the script. When I monitor the Memory usage via Task Manager, it barely goes up to 1.5 GB for Python, I have more than 15 by the time I get an error.Problem description
When I trying to connect directly to DB, it takes infinite time to get the data out, I ran the script and it did not finish in 30 minutes. I went to SO and got the impression that this is not a good way to handle this amount of data via a connector, so I exported this DB to CSV.
When I tried to work with CSV, I got
MemoryError
. I tried to specify both engines,c
andpython
, different chunk sizes, low_memory asTrue
andFalse
, specified dtypes and not, but still gettingMemoryError
.I spotted only two difference:
If I parsing without chunks that I get
MemoryError
on parsing.When I am parsing in chunks - on concatenation into one DF.
Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.
Here is an error message:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
What am I facing here? Is this some kind of limitation to Pandas and I need to switch to Spart or Dask or I am somewhere wrong?