the-siesta-group / edfio

Read and write EDF/EDF+ files.
Apache License 2.0
25 stars 5 forks source link

Please be strict #31

Open Teuniz opened 7 months ago

Teuniz commented 7 months ago

when opening & reading EDF files and undo the following patches:

Ofcourse, you do this only if you care more about the future of the EDF format than disgruntled users.

If you want to help your users, offer them a (Python) script/program that fixes their files. This will create awareness in case of a non-compliant file AND it will motivate to resolve/report the problem at the source.

Thanks!

cbrnr commented 7 months ago

AND it will motivate to resolve/report the problem at the source

Theoretically, this is true, but practically, this is not going to happen. The source will be some company, which means they will only respond if their profits are affected by their non-standard EDFs. Which very likely is not the case.

So I'm OK with being a bit less strict, because telling users to fix their files elsewhere is not a great user experience (it's not their fault in most cases). But I'd be OK with issuing a warning or maybe an info message in the logger if non-standard fields are encountered.

Teuniz commented 7 months ago

Theoretically, this is true, but practically, this is not going to happen.

It already happened. Thanks to EDFbrowser being strict, today there are less non-compliant files in the wild. Most non-compliant files are from the past.

Anyway, there's no disadvantage of being strict when offered a solution to fix the file (like EDFbrowser does).

hofaflo commented 7 months ago

Thank you both for your input!

Allowing to read non-compliant EDF files was one of the main motivations in creating this package. We still see this as an important part of the user experience, especially when working with large numbers of files from different sources.

However, you make a good point regarding awareness of non-compliant files and providing a way to fix them, ideally an easy one. In an internal discussion, we came up with the following strategy:

This way, any file created with edfio is compliant, users are made aware of non-compliance, and they have an easy way to fix it.

Autofixes should include the following things, suggestions are very welcome:

cbrnr commented 7 months ago

I like this plan, I only have some minor comments:

Teuniz commented 7 months ago

Allowing to read non-compliant EDF files was one of the main motivations in creating this package.

That's a pitty, I didn't know that. If I should have known, I wouldn't have bothered and created this issue.

However, you make a good point regarding awareness of non-compliant files and providing a way to fix them, ideally an easy one. In an internal discussion, we came up with the following strategy:

* continue to read all files that can reasonably be read

* issue warnings when accessing non-compliant fields that can be read (e.g., starttime separated by `:` instead of `.`, non-ascii label, lowercase month in EDF+ dates)

* add a parameter `fix_errors` to `Edf.write()` (the name is up for discussion)

It will not make any difference. The whole idea is to create just enough "annoyance" to motivate people to start to brake balls at the source (the person responsible for the non-compliant files). Your proposal will not create annoyance, at least not enough for people to at least try to do something about it. If you don't force people to interrupt their workflow and offer them a separate/manual tool to fix it, they will simply ignore it and continue with their work because time is money and why should they bother to brake balls at the source if they have a working solution at hand.

hofaflo commented 7 months ago

By default, the reader function should issue a warning for every non-standard field that it can read. However, I'd appreciate a parameter to turn these warnings off.

Good idea! We might prefer having them turned off by default, but are open for discussion (optimally in the upcoming PR).

So would it make sense to provide an autofix option in the reader function?

I thought about that again and now think that a separate method for autofixing would be the best option. That would avoid additional checks on read and make it most flexible to apply only on selected recordings.

I assume the current behavior of Edf.write() is to always error if anything is non-compliant, right? Instead of an autofix option at write, would it not make more sense to enforce standard compliance already when setting the various fields?

All setters are already strict, so any newly instantiated Edf object or modified header field is compliant. However, Edf.write currently performs only minimal checks regarding signal length and data record duration. The intention was to allow modifying and writing non-compliant files without having to fix everything that is broken. But with the prospect of an autofix method, being strict about compliance in Edf.write should be tolerable.

Allowing to read non-compliant EDF files was one of the main motivations in creating this package.

That's a pitty, I didn't know that.

Maybe we can be more explicit about that in the readme (currently it says "Fail late on read: Non-compliant header fields only raise an exception when the corresponding property is accessed.").

If you don't force people to interrupt their workflow and offer them a separate/manual tool to fix it, they will simply ignore it [...]

I get your point. Still, we want to prioritize ease of use and minimize manual intervention.

marcoross commented 7 months ago

I thought about that again and now think that a separate method for autofixing would be the best option. That would avoid additional checks on read and make it most flexible to apply only on selected recordings.

I agree. This could actually be combined with another method that allows for checking the EDF file for compliance. This way a developer could implement a workflow that checks if there are non-compliant fields in the EDF, informs the user about the issues, and then fixes them if possible.

All setters are already strict, so any newly instantiated Edf object or modified header field is compliant. However, Edf.write currently performs only minimal checks regarding signal length and data record duration. The intention was to allow modifying and writing non-compliant files without having to fix everything that is broken. But with the prospect of an autofix method, being strict about compliance in Edf.write should be tolerable.

That's true. So the with the current behavior it is possible to read invalid files, make some minor modifications, and then write the file, keeping the invalid fields. I think we should definitely change this (as was suggested anyway). I'd like to be able to claim that the library will never write invalid EDF files (even if they are based on an invalid EDF files that served as input).

adammj commented 1 month ago

I say, please continue following Postel's law ("be conservative in what you do, be liberal in what you accept from others"). 😀

I just discovered this package (but need to file an issue separately), and it has no problem ingesting the common issues that I've found some python EDF libraries will fail on.

Echoing what was said above, tons of files come from sources that we have no influence or control over (and often were collected years or decades ago)—so there is no one to complain to. And modifying all of them to "fix" obvious things is a non-starter, as then you're working with different files than anyone else taking from the same data repository. This is a huge issue I've seen across various labs, mutating the raw input files that should be immutable. Don't change your raw source data!