jazzband / tablib

Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.
https://tablib.readthedocs.io/
MIT License
4.59k stars 590 forks source link

tablib no longer able to import Google Docs exported xlsx #465

Closed dmosberger closed 4 years ago

dmosberger commented 4 years ago

tablib used to be able to read xlsx files exported from Google Docs just fine. Now it fails with InvalidDimensions if not all columns in the spreadsheet are populated with values. For example, download this spreadsheet as xlsx (also attached below) and then try to read it like this:

python3 -c "import tablib; tablib.Databook().load(open('test.xlsx', 'rb').read(), 'xlsx')"

and it'll result in an InvalidDimensions exception. The problem seems to be due to the fact that the first row has values in 2 columns, but the second row has an empty cell in the second column. I'm guessing something changed in the way Google exports the xlsx files? test.xlsx

claudep commented 4 years ago

Isn't this the same issue as #226 ? That was supposed to be fixed in tablib 1.1.0. I cannot access your xlsx sample file, please put it in a non-protected location (with non-sensitive data, of course).

dmosberger commented 4 years ago

226 certainly looks related, but the fix seems to be specifically for csv files, not xlsx, so it doesn't look to be the same. I'm not really familiar with tablib internals though, so perhaps I'm missing something.

The test.xslx is attached to my original post, so you shouldn't need a Google account to download the test case.

claudep commented 4 years ago

Oh right, I'll have a look soon.

claudep commented 4 years ago

That might be a Google Docs pecularity. If I save the same file with LibreOffice as .xlsx, I'm getting two values ((<ReadOnlyCell 'Sheet1'.A2>, <EmptyCell>)) for the problematic row. We may apply the same fix as for the csv format to circumvent this use case.

dmosberger commented 4 years ago

Yes, I agree it's very likely something changed in Google Docs.

I tried commit 71ff737 and it fixes the test case, but I think there is still a problem if the first row has fewer columns than a subsequent row. See the attached test case.

python3 -c "import tablib; tablib.Databook().load(open('test2.xlsx', 'rb').read(), 'xlsx')"

still gives me an InvalidDimensions error.

test2.xlsx

claudep commented 4 years ago

Sure, however your latest example looks clearly wrong to me. tablib expects generally at least a full row of headers. So I don't think we can support all use cases.

dmosberger commented 4 years ago

Hmmh, I don't know. I quite often have a fixed format table and then use a cell to the right of a row to make random comments. For example:

 Col A   Col B
 val1    val2
 val3    val4    This_is_comment_about_row_3

It'd be sad if tablib can't load such a file.

EbubekirUstalar commented 2 years ago

@hugovk, @claudep could you add the solution to the import_set as well, you just added it to import_book.