pyexcel / pyexcel-io

One interface to read and write the data in various excel formats, import the data into and export the data from databases
http://io.pyexcel.org
Other
58 stars 20 forks source link

mmap files are treated as streams but can't be read #34

Closed dracos closed 7 years ago

dracos commented 7 years ago

xlrd can operate on mmap files, so it would be useful to be able to pass in one to pyexcel, e.g.

sheet = pyexcel.get_sheet(file_type='xls', file_content=mmap.mmap(fp.fileno(), 0,
    access=mmap.ACCESS_READ))

isstream is true for an mmap because it has a read function, and so even though I passed in file_content, pyexcel-io's get_data calls load_data with file_stream rather than file_content. But this then means that down in pyexcel-xls, either getvalue is called (current release) which errors, or read is called without argument (after https://github.com/pyexcel/pyexcel-xls/issues/16's fix) which errors as mmap's read must have an argument.

My aim is to not have to read in file contents in one go at all anywhere, and for pyexcel-xls/xlrd, mmap appears to be the only way (and https://github.com/pyexcel/pyexcel-io/issues/33 would fix it for CSV I think). What I have done is create an mmap subclass that does not have a read function, which then means, pyexcel-io passes file_content through to pyexcel-xls and thus xlrd.

chfw commented 7 years ago

Yes, I agree with what you have found. And I will look at mmap options for file_stream and file_content. Could you please evaluate get_data(... streaming=True..)? streaming would enable 'yield' command and would allow you to process large csv at least. For large xls, I will have to have a look at mmap.

dracos commented 7 years ago

streaming=True nearly works, but my #33 shows the one case I think is left - the entire CSV file is read in by the read() at https://github.com/pyexcel/pyexcel-io/blob/1cffd9d2edbe8decc30968281934fcfd6a3ad774/pyexcel_io/fileformat/_csv.py#L269

In case it's of interest, here's what I've done to switch from csv DictReader/DictWriter to pyexcel: https://github.com/mysociety/mapit.mysociety.org/commit/3c3dd947c05817bbffbdfc75ac7f3592b76f3ebc I think (apart from #33 and whatever full reads the underlying odfpy package might do) this is hopefully fully iterative and not loading anything into memory. Or hopefully near enough anyway! Thanks for providing this package :)

chfw commented 7 years ago

I see you used igetrecords and that is OK. "streaming=True" is passed and kept by iget and isave_. odfpy and ezodf both read file fully into memory. However, pyexcel-odsr, a strip-down and ods only version of messytable. If you need both pyexcel-ods and pyexcel-odsr installed, you could specify iget_records(...library='pyexcel-odsr').

chfw commented 7 years ago

please verify using pyexcel-io 0.3.4