pyexcel / pyexcel-io

One interface to read and write the data in various excel formats, import the data into and export the data from databases
http://io.pyexcel.org
Other
58 stars 20 forks source link

optimize calls to number_of_columns() in SheetReader._iterate_columns() #25

Closed ashkulz closed 7 years ago

ashkulz commented 7 years ago

This can be costly for certain implementations e.g. pyexcel-xlsx (which internally uses openpyxl) where it is computed on-the-fly.

It isn't possible to make the fix in pyexcel-xlsx as there is the possibility of it changing due to modifications via the API. One does have to make the (rather reasonable) assumption that the number of columns doesn't change during execution of to_array.

codecov-io commented 7 years ago

Current coverage is 99.00% (diff: 100%)

Merging #25 into master will increase coverage by <.01%

@@             master        #25   diff @@
==========================================
  Files            31         31          
  Lines          2517       2518     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2492       2493     +1   
  Misses           25         25          
  Partials          0          0          

Powered by Codecov. Last update 7bcfb4e...f95361b

chfw commented 7 years ago

Your pull request is appreciated. However, I failed to see the logical difference although the file were changed.

At the moment, pyexcel-xlsx use max_column to get the number of columns. Are you suggesting a different function to be used?

In order to support columns computed on the fly, here is an example in ods where _iterate_columns was overriden.

I think the interfaces in Sheet.py could be revised further to make it clearer. Do you have any suggestions?

ashkulz commented 7 years ago

When pyexcel-xlsx calls max_column, it triggers a dynamic computation which can take a lot of time ... in case you have 10000 rows, it gets called 10000 times instead of 1, which is quite wasteful. All the change does is to pre-compute it so that it doesn't have to be done every time.

In case _iterate_columns is re-implemented in a derived class, then the ncols parameter can be ignored.

ashkulz commented 7 years ago

As a comparison, in 0.2.x the commit advarisk/pyexcel-io@eb3bfbc87aee20ac72b39f3971650c4d34608914 feels very clear and logical (which is what I'm actually using in production). I don't think that the API with _iterate_columns is perfect; but don't really have an alternate suggestion :frowning_face:

chfw commented 7 years ago

I see what you meant now. max_column is computed always hence your change will cache the value which could improve performance.

Let me do a bit research around it.

ashkulz commented 7 years ago

In case you're open to releasing 0.2.4.1 with the above change, I'll submit a PR for that too ... but there is no 0.2.x branch, which is what I've done in the fork.

chfw commented 7 years ago

have create branch 0.2.x and will release it while I review all plugins and make necessary changes in 0.3.0 and its plugins.

chfw commented 7 years ago

please evaluate latest pyexcel-xlsx for performance improvement.