ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python
Other
169 stars 74 forks source link

mdfreader is very slow with large number of channels #51

Closed mattispasch closed 7 years ago

mattispasch commented 7 years ago

Problem: When using mdfreader with mdf4-files which contain a huge number of channels (> 100.000) processing takes a long time. This is about an hour for mdf4info and mdf4reader each. I my case this is particulary annoying since the files are very small (short time frame)

Cause: The main cause is that the software is checking for duplicate channel names based on lists. Using sets for those lookups reduces runtime complexity from O(n^2) to O( n*log(n) ). => Runtime is reduced to about 1/100th in my case.

Fix: I'll setup a pull request referencing this issue. I wasn't able to test this exact software version due to a bunch of bugs concerning my test mdf files. So please accept with caution. I hope it's still useful to point to the required changes

ratal commented 7 years ago

Thanks a lot for this proposal. I commited modifications from your pull requests. Please have a look if expected perfomance improvement is ok for you (without yet lazy initialisation)

mattispasch commented 7 years ago

This has a 100x (literally) bigger influence than the lazy initialisation idea. Since the project I fixed the performance for is already finished, there is no further opinion on my side ;-) I just wanted you to receive those improvement as you helped me a lot by providing your software as open source - thanks!

ratal commented 7 years ago

This was a good idea to propose to use set for improving performance, thanks. If you are satisfied with it, I would prefer to avoid the lazy initialisation. As you mentionned, could be risky for multithreading and even without this consideration, it brought couple of issues with export functions ; for instance calling units that are not existing after closing the file, etc. I could add mechanisms that wil reopen the file but seems complicated/not robust for benefits I cannot graps for the moment.