ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python
Other
169 stars 73 forks source link

export_to_csv(): Is it possible to remove the 100hz resample? #201

Closed ecalpy closed 2 years ago

ecalpy commented 2 years ago

Python version

3.7.7

Platform information

Spyder

Numpy version

1.18.1

mdfreader version

4.0

Description

I'm trying to batch convert mdf files (mdf4 to mdf3) and also export to .csv, but we're required to not have any resampling. I realize this takes a long time, but we will have machines dedicated to doing it so the timing isn't a huge issue.

Is there an option or a way to deactivate the automatic resampling in export_to_csv()?

Thank you!

ratal commented 2 years ago

Hi, Because of the nature of .csv file format, I do not see an efficient way to export data with several sampling time at once. Some columns would be longer than others, a lot of empty cells. Probably more efficient if you split it into files, one per sampling (or data group) ? I guess you would expect something a bit like what is is already done for xlsx export ?

ecalpy commented 2 years ago

Hi- thanks for the quick response!

I haven't tried xlsx export, but our requirement is to have an uncompressed .csv file. Maybe it works to go xlsx and convert to .csv.

Looking into ASAMMDF, does it do the same? It seems to generate much larger files, but there still seems to be some sort of resampling. @danielhrisca - any input on that?

Edit: I just tried export_to_xlsx(), and yes- precisely what I would want in .csv format.

danielhrisca commented 2 years ago

Looking into ASAMMDF, does it do the same? It seems to generate much larger files, but there still seems to be some sort of resampling. @danielhrisca - any input on that?

the file is big because all channels are interpolated using the union of all time stamps

ecalpy commented 2 years ago

Looking into ASAMMDF, does it do the same? It seems to generate much larger files, but there still seems to be some sort of resampling. @danielhrisca - any input on that?

the file is big because all channels are interpolated using the union of all time stamps

Does that mean it's just resampled to the longest time stamp array?

danielhrisca commented 2 years ago

No it means that all the time channels are merged into a single one that contains all the unique time stamps. After that all the channels are interpolated using the new merged time channel

ecalpy commented 2 years ago

Thanks for the explanation. I will see if this is acceptable for my project.

Thanks again for the great packages! Did you guys collaborate on your two different ones? (Always wanted to ask)

ratal commented 2 years ago

Making a csv with all your data not resampled is possible but honestly, I doubt this will be practical for your end user. We know each other and met during conference but work for different companies and located in different countries. Our packages are having different objectives and approach but globally beneficial for the community I think. Daniel is way more active than me past years and could grow a good contributing community which is not really existing for mdfreader. I have less time to spend on this package, rather in maintenance for the moment, maybe more active for next mdf standard release.

ecalpy commented 2 years ago

I agree with you- It's definitely not practical for anyone! But it was the original request. I think the project will eventually agree to a resample very soon.

We can close this issue now. Thanks for both of your support!

You piqued my interest now, is it easy to explain the different objectives between the mdfreader and asammdf?

Regardless of the differences, these two packages are very helpful for our industry.

ecalpy commented 2 years ago

Final questions regarding export_to_csv()...

is there a way to specify which time axis is utilized in the export in column A? Would there be an easy rename option (e.g. "Time") if that's not possible?

Thanks!

ratal commented 2 years ago

Only my personal opinion: asammdf is very strong for data science, especially thanks to its GUI. It is easy to manipulate mdf files. To me, mdfreader is more for advanced python users, especially in the domain of big data. Thanks to cython module, you can have better performance to read files (depends of use cases lately, asammdf progressed a lot on this): at a downside, sometimes not easy to properly install by all potential users. Because of its design, data are directly at reach in interactive interpreter, while rather via API for asammdf. Nowadays, asammdf source is also becoming very complex with files of 10k lines for instance which could make it difficult to customise (but in the end, contribution is also possible) At work I see a much bigger user base for asammdf, but still for some cases mdfreader is used. Maybe you have different opinion @danielhrisca ?

danielhrisca commented 2 years ago

Ever since asammdf 5.0.0 there is no option to load the channel samples in the RAM, so only the file metadata is loaded. The samples are extracted on demand and in a chunks so memory usage is most times low. This was done with the "big data" in mind for the cases where you can't fit everything in the RAM. Using the filtering option on file load and the select method deliver good performance I would say. If you have any example were the speed in an issue I would be very interested to investigate.

Because of its design, data are directly at reach in interactive interpreter, while rather via API for asammdf.

The internal representation of mdfreader is simpler indeed. There is a bigger learning curve to using the asammdf API and its internal data representation.

Nowadays, asammdf source is also becoming very complex with files of 10k lines for instance which could make it difficult to customise (but in the end, contribution is also possible)

As you know the MDF spec is really complex, especially since version 4 (not mention the new additions in 4.20). Having an almost 1-to-1 internal representation results in a complex code base as well

ratal commented 2 years ago

This was done with the "big data" in mind for the cases where you can't fit everything in the RAM. Using the filtering option on file load and the select method deliver good performance I would say.

Using mdfreader channel_list or no_data_loading parameters, you can reach same feature. Also reading by chunks, there is internal parameter for chunk size you can tweak to reach best performance if needed. However, the file metadata (blocks) data structure is not optimum in mdfreader and could lead to more memory consumption. I must admit that mdfreader was originally designed to convert complete files into another format in batch for processing with Matlab or excel -> you can feel it in its design. After all I did this project to learn python language.

ecalpy commented 2 years ago

I too also did heavy excel then MATLAB analysis on MDF files but now I try to stay completely in Python and only export results or reports to excel/html. Thanks to both of your tools, this is possible and relatively easy! I learned most of my python on mdfreader :)

I would close this issue but there was one remaining open question that I somehow formatted weird: -Is there a way to specify which time axis is utilized in the export in column A? -Would there be an easy rename option (e.g. "Time") if that's not possible?

Thanks!

ratal commented 2 years ago

If you resample before exporting, you can choose the master channel using master_channel parameter of resample() method. There is a rename_channel() method existing. But be careful, it has to be unique name, otherwise, it will not rename.

ecalpy commented 2 years ago

Thanks, I will try that!