pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.24k stars 17.79k forks source link

Wish: Input / output for Open Document Spreadsheet (ODS) #2311

Closed lbeltrame closed 5 years ago

lbeltrame commented 11 years ago

It would be nice to have, along with XLS and XLSX, also support for ODS (used notably by OpenOffice.org and Libreoffice, but not only that).

The main problem I saw is that there's a plethora of libraries to handle that: however not sure which is the "best" one (with a licensing that would fit with pandas).

paulproteus commented 11 years ago

It seems to me the best option for this would be:

This would be a rather substantial bit of work.

A good start for the test cases would be to look at pandas/io/tests/test_excel.py and write test cases that mirror what is in that file.

H0R5E commented 9 years ago

+1 for this feature.

Working on an open-source data wrangling project and I would like to be able to interact with spreadsheet software that is not Excel.

jtratner commented 9 years ago

@H0R5E for now, you could just convert the resulting excel file to ODS (it's pretty much a simple set of numbers, nothing more): http://stackoverflow.com/questions/15257032/python-convert-excel-file-xls-or-xlsx-to-from-ods or you could use json as an intermediary and use tablib to export to ODS: https://pypi.python.org/pypi/tablib

robertmuil commented 9 years ago

+1 for this!

davidovitch commented 9 years ago

I want to tune in with some additional notes:

Based on some initial looks, ezodf has a straight forward interface. For example, loading an ods spreadsheet into a DataFrame could go as follows:

import pandas as pd
import ezodf
doc = ezodf.opendoc('somefile.ods')

print("Spreadsheet contains %d sheets.\n" % len(doc.sheets))
for sheet in doc.sheets:
    if sheet.name == 'data':
        data = sheet
    print("Sheet name: '%s'" % sheet.name)
    print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
    print("-"*40)

# a random cell
pos = (row, col)
data.get_cell(pos).display_form

df_dict = {}
for i, col in enumerate(data.columns()):
    # col is a list that contains all the cells (rows of that column)
    # assume the column name is on the first row
    colname = col[0].display_form
    df_dict[colname] = []
    print(colname)
    for j, row in enumerate(col):
        # ignore the header
        if j == 0:
            continue
        else:
            df_dict[col_name].append(row.display_form)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)

I have no idea how odfpy and ezodf compare performance wise, or how their different memory footprints are for very large spreadsheets. I haven't tried to use odfpy because that seems more complicated to get going with. I also have not checked how robust the writing process of ezodf is. I'll try to see if I can get something similar going as for pandas/io/tests/test_excel.py.

I will also mention @sorenroug from odfpy and @T0ha from ezodf just in case someone from upstream is interested in this discussion.

davidovitch commented 9 years ago

After a short exploration, it seems doable to create an ods prototype reader based on the structure of pandas/io/excel.py using the ezodf module. I've forked pandas here and will implement an ods prototype reader in the io-ods branch.

jtratner commented 9 years ago

@davidovitch - one thing to keep in mind if you choose to handle both writing and reading: you can register an ODS writer in the excel module (just implementing the required methods as listed in the code) and then you can just add a simple subclass of the excel writer tests and get a ton of test cases for free. Just a thought.

davidovitch commented 9 years ago

I would like to give an update on the current status. My fork can currently read ODF ODS files, but writing is not implemented yet. The corresponding pull request is still pending review, and it has been a lot more work than I originally anticipated.

Just a few days ago I bumped into yet another library to read/write spreadsheet: pyexcel. What is different about this library is that it aims at creating a single API for all the different read/write libraries, and it builds on all the existing libraries out there (ezodf, xlrd, etc). This is actually kind of similar to what pandas currently has, and to what I am trying to extend in PR #9070. So I am wondering, would it make sense to use a single API library (with optional dependencies to all the relevant readers/writers for the different spreadsheet types) instead of developing something similar, but tailored to pandas use case? I am speculating that a lot of code in io/excel.py could be removed when relying on pyexcel, but it adds yet another dependency. At this stage I am not sure how all different edge cases on data types and other magic happening inside the spreadsheet interpretation differs between the current pandas implementation and pyexcel.

davidovitch commented 9 years ago

The ods reader is not ready for the upcoming 0.17 release. PR #9070 is closed (see the PR for a technical discussion), and a new improved PR will be made by someone at some point in the future. I have a working version in the ezodf_reader branch of my pandas fork in case someone wants to have a look at it. Suggestions and improvements are welcome :-)

H0R5E commented 8 years ago

@davidovitch thanks for your efforts! Its a shame that that reading and writing to ODF is not better supported in python, otherwise I'm sure this would be less of an ordeal.

PBrockmann commented 8 years ago

Working on how to structure data for paleoclimatology community, I would like to promote ODS as the standard format. Then of course, pandas with this possible reading and writing at ODS format would become the natural way to start nice analysis.

+2 and more for this feature that seems so closed to be released.

naught101 commented 8 years ago

You should almost certainly be using netcdf for paleoclimate data. It's more or less industry standard. Checkout the python package 'xray' for a nice way to interface with it.

On 3 November 2015 12:30:20 am AEDT, Patrick Brockmann notifications@github.com wrote:

Working on how to structure data for paleoclimatology community, I would like to promote ODS as the standard format. Then of course, pandas with this possible reading and writing at ODS format would become the natural way for IO and start nice analysis.

+2 and more for this feature that seems so closed to be released.


Reply to this email directly or view it on GitHub: https://github.com/pydata/pandas/issues/2311#issuecomment-153016604

Sent from my Android device with K-9 Mail. Please excuse my brevity.

PBrockmann commented 8 years ago

@naught101: I know quite well netCDF format and some conventions like CF (Climate and Forecast) widely used in the Earth System Modelling community but it seems that the PaleoClimate one has also good argument to keep working with others standard. An article that is under discussion describes this motivation and promote a JSON-LD format. Read https://www.authorea.com/users/17200/articles/19163/_show_article and please feel free to contribute it. In my understanding, the data themselves are not problematic, the difficulty comes more from the metadata that are in many aspect hierachical. Netcdf attributs are in this case not very well designed. On the other hand, describe those metadata with a simple spreadsheet with a dotted notation in an open source format like ODS (I will do a proposition in this way) could be interesting because it will not radically change the behaviour of PaleoClimate scientists that widelly work with spreadsheets. All this, to say that if pandas could allow IO from ODS, it could be a good motivation to start their analysis with pandas.

shoyer commented 8 years ago

@PBrockmann If your metadata is hierarchical, maybe it would be appropriate to use the hierarchical features of netCDF4/HDF5? Just a thought. In any case, I think we agree regardless that ODS support would be valuable for pandas.

hnykda commented 8 years ago

Is there any progress on this? Thanks

jreback commented 8 years ago

@hnykda there is a PR #9070 that is not far away, but needs some work if you'd like to pick it up.

hnykda commented 8 years ago

Will take a look at it. Thank you for pointing it out.

jameserrico commented 7 years ago

On a similar note has anyone considered Google Sheets? There are some limits to the size of a sheet (http://gappstips.com/google-sheets/google-spreadsheet-limitations/) that probably make it only usable for some cases but there are quite a lot of datasets that would fit within this restrictions and the ease of collaboration with Sheets is nice.

detrout commented 7 years ago

@jameserrico There's a library https://github.com/embr/gcat that can read from google drive. I have a fork that I updated to work with python3 https://github.com/detrout/gcat but upstream hasn't merged the pull-request

It knows how to return pandas data frames. I.e. I have some code like this:


    book = gcat.get_file(book_name, fmt='pandas_excel')
    experiments = book.parse('Experiments', header=0)`
detrout commented 7 years ago

I have a standalone reader styled like pandas.io.excel with tests. https://github.com/detrout/pandasodf

It uses odfpy as that seems like its still maintained. Any comments, or should I try reformatting into a pull request creating pandas/io/odf.py?

I thought the .read_odf method seemed more reasonable, than overloading .read_excel

naught101 commented 7 years ago

I thought the .read_odf method seemed more reasonable, than overloading .read_excel

Makes sense. Perhaps there could be a read_spreadsheet that calls read_csv, read_excel, read_odf etc, depending on the extension?

detrout commented 7 years ago

@naught101 The general reader might have some difficulties since there are differences in what the various formats support. CSV/TSV only have a single table, while Excel & ODF can have multiple tables per file.

Also I'm currently contemplating passing the raw odfpy xml cell node for the converters callable, so those would be quite difficult to write if you didn't know the source file type.

davidovitch commented 7 years ago

I thought the .read_odf method seemed more reasonable, than overloading .read_excel

@detrout FYI, maybe things have changed since, but at the time of PR #9070 there was a very clear decision by the pandas devs not to do that and to keep read_excel for both the Excel and Open Document Format families.

TrigonaMinima commented 5 years ago

@jreback is https://github.com/pandas-dev/pandas/pull/9070 the right starting point on this or do you suggest starting with something else?

jreback commented 5 years ago

yes that issue is a good starting point

note that the internal code org has changed a lot since then

H0R5E commented 5 years ago

Great to see the ODS reader implemented. Well done! I'm not sure this should have been closed though, as writing still isn't supported.

jreback commented 5 years ago

i would open a new issue for write support (if it’s really needed)