Closed dracos closed 7 years ago
Yes, I agree with what you have found. And I will look at mmap options for file_stream and file_content. Could you please evaluate get_data(... streaming=True..)? streaming
would enable 'yield' command and would allow you to process large csv at least. For large xls, I will have to have a look at mmap.
streaming=True
nearly works, but my #33 shows the one case I think is left - the entire CSV file is read in by the read()
at https://github.com/pyexcel/pyexcel-io/blob/1cffd9d2edbe8decc30968281934fcfd6a3ad774/pyexcel_io/fileformat/_csv.py#L269
In case it's of interest, here's what I've done to switch from csv DictReader/DictWriter to pyexcel: https://github.com/mysociety/mapit.mysociety.org/commit/3c3dd947c05817bbffbdfc75ac7f3592b76f3ebc I think (apart from #33 and whatever full reads the underlying odfpy package might do) this is hopefully fully iterative and not loading anything into memory. Or hopefully near enough anyway! Thanks for providing this package :)
I see you used igetrecords and that is OK. "streaming=True" is passed and kept by iget and isave_. odfpy and ezodf both read file fully into memory. However, pyexcel-odsr, a strip-down and ods only version of messytable. If you need both pyexcel-ods and pyexcel-odsr installed, you could specify iget_records(...library='pyexcel-odsr').
please verify using pyexcel-io 0.3.4
xlrd
can operate on mmap files, so it would be useful to be able to pass in one to pyexcel, e.g.isstream
is true for an mmap because it has a read function, and so even though I passed infile_content
, pyexcel-io'sget_data
callsload_data
withfile_stream
rather thanfile_content
. But this then means that down in pyexcel-xls, eithergetvalue
is called (current release) which errors, orread
is called without argument (after https://github.com/pyexcel/pyexcel-xls/issues/16's fix) which errors as mmap's read must have an argument.My aim is to not have to read in file contents in one go at all anywhere, and for pyexcel-xls/xlrd, mmap appears to be the only way (and https://github.com/pyexcel/pyexcel-io/issues/33 would fix it for CSV I think). What I have done is create an
mmap
subclass that does not have a read function, which then means, pyexcel-io passes file_content through to pyexcel-xls and thus xlrd.