mne-tools / mne-bids

MNE-BIDS is a Python package that allows you to read and write BIDS-compatible datasets with the help of MNE-Python.
https://mne.tools/mne-bids/
BSD 3-Clause "New" or "Revised" License
131 stars 85 forks source link

Do not zero-pad indices in BIDSPath entities anymore; do not warn about prepending missing dots for extensions #1215

Closed alexrockhill closed 6 months ago

alexrockhill commented 8 months ago

PR Description

I often have subject ids like 1 and 2 and I don't always store them as strings. I think it would be nice if mne_bids just handled this to allow for integer subject ids, runs, sessions, acquisitions etc.

Merge checklist

Maintainer, please confirm the following before merging. If applicable:

alexrockhill commented 8 months ago

Hmmm it looks like this was by design that integers were excluded, there's even a test (failing in this case) that subject=1 fails. I find this a bit annoying because I don't understand the need for IDs like "01" on modern computing systems, it seems like a design choice that some but not all people make that can lead to Y2K phenomena if for instance you end up collecting more than 100 or 1000 subjects when you started with a "00" or "000" system. Is there a rationale for blocking users from doing this? To me it seems overly pedantic.

alexrockhill commented 8 months ago

image

Given that it's the "01" way on the homepage of BIDS, I guess this is fairly used but I don't recall seeing an explanation of why you need to format them this way. With the key-value tags in BIDS, this seems superfluous and from an era when strings were hardcoded to a certain length which is a brittle design choice for long identifiers..

alexrockhill commented 8 months ago

Also if you pass run=1 it automatically gets mapped to "01" which seems very odd and unexpected.

hoechenberger commented 8 months ago

I agree:

alexrockhill commented 8 months ago

Nice, okay, I thought I was alone in thinking this was a bit silly and that everyone else labeled their subjects this way for a minute. If the consensus is that this is that the labels should be less constrained, I'm happy to add that to the PR.

hoechenberger commented 8 months ago

@alexrockhill We better double-check check the BIDS specs before we proceed here Otherwise I'm +1

alexrockhill commented 8 months ago

https://github.com/bids-standard/bids-specification/discussions/1679

Remi-Gau commented 8 months ago

Note that in BIDS runs are indices (i.e must be a digit) and subjects, sessions (and most other entities) are labels and can be alphanumeric.

https://bids-specification.readthedocs.io/en/latest/appendices/entities.html#sub

https://bids-specification.readthedocs.io/en/latest/appendices/entities.html#run

This partly explains why you can assume:

run-01 --> 1

but not

sub-01 --> 1 because you could also have sub-1

Hope this helps a bit.

jasmainak commented 8 months ago

I think @Remi-Gau is on the right track. It's why we disabled integers so that users are explicit and don't mix sub-01 and sub-1

hoechenberger commented 8 months ago

@alexrockhill I think if you pass strings, no alterations will take place. So this would be the easiest workaround I suppose?

alexrockhill commented 8 months ago

I think @Remi-Gau is on the right track. It's why we disabled integers so that users are explicit and don't mix sub-01 and sub-1

I understand the principle but I don't understand the use case. Who labels their subjects 1 and 01?

My preference would be to allow integers and cast them to strings since subject=1 and subject="1" seem like reasonable subject identifiers to me and would mean the same subject. Since it's just annoying in code to cast everything e.g. subject=str(subject) and so it seems convenient to do this within mne_bids to make it as usable as possible. In general, I will say the annoying thing about BIDS is that it can sometimes feel like death by a million paper cuts where when you try and use it you get caught up on a bunch of little things like this and each one alone is rather silly to complain about but the end result is that people think it's more of a pain than it is. For example, in this context, figuring out if each tag is an index or a label (i.e. accepts int or string), having to remember if extension=".edf" is right or extension="edf" is right or having to use extension and suffix even though if you pass datatype to read_raw_bids so the file type is able to be inferred (I know internally it's this way because BIDSPath isn't just used for read_raw_bids but also modified to find sidecars but that doesn't really matter from an end-user perspective). Anyway, the point being, this is one of several pain points that I think, especially for write_raw_bids and read_raw_bids is pretty mission-critical for adoption so that people don't run into an error that isn't one that is essential to ensuring that the dataset contains all the necessary information and give up. If something is only recommended by BIDS, I don't think you should throw an error in mne_bids. A warning maybe but an error is overly pedantic in my opinion.

The run one is just a bug, that seems pretty simple. You should either not be allowed to pass integers or run=1 should be cast to run="1" but definitely not run="01".

alexrockhill commented 8 months ago

I'd actually go further if it was entirely my design choice and I would match subject=1 to subject="01" and only throw an error at run time if there are both subject="01" and subject="1". To me, it seems way more practical and common of a use case to have incrementing subject indices and I think you'd actually want an error when you try and make subject="1" and subject="01" that seems like a situation that should not be allowed, it's too confusing, so actually it would be better to throw the error during writing.

That would also mean run="01" or run="1" or run="1" would be irrelevant, they would be detected as incrementing indices and mapped to integers and tested for uniqueness. A bit of a reiteration, but I really think run="01" and run="1' both being allowed does not make a lot of sense; different experimenters might have these different opinions on zero filling and enter them differently in a free fill entry box and cause issues which seems like a very real use case.

Lastly, I'd add we could store them all per BIDS recommendations with the two digit zero fill which I wouldn't mind at all. i.e. If you pass subject=1 it gets stored as subject="01" but then when you go to read with subject=1 you still get your subject. I think that's actually very maintainable since there are two cases 1) alpha-numeric and 2) numeric only so in 1) you just leave it and in 2) everything gets cast to two digit zero padded. I really like that solution.

The only issue is that it wouldn't work for datasets with subject="01" and subject="1" but I think that's such an implausible case that as a user, I'd like an error for a dataset like that so I can check what is going on.

alexrockhill commented 8 months ago

I went to see how it would be implementing and saw this

warn(
    f'extension should start with a period ".", but got: '
    f'"{extension}". Prepending "." to form: ".{extension}". '
    f"This will raise an exception starting with MNE-BIDS 0.12.",
    category=FutureWarning,
)
kwargs["extension"] = f".{extension}"
# Uncomment in 0.12, and remove above code:
#
# raise ValueError(
#     f'Extension must start wie a period ".", but got: '
#     f'{extension}'
# )

To me, this is crazy. Why are you raising an error on a user who passes extension="edf"? This almost seems antagonistic to a well-intentioned person who wants to use BIDS. I don't mean to be overly critical or criticize design choices on a personal level, mne_bids is great and a lot of other people have put some fantastic work into this, there's just a few choices like this that I really don't understand from the perspective of the basic approach. I thought the ethos of BIDS was to standardize the essential things and then encourage wide adoption by being flexible otherwise and that ethos would carry down to generally trying to be as not pedantic as possible.

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (87eea28) 97.61% compared to head (6c48c6d) 97.60%. Report is 5 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1215 +/- ## ========================================== - Coverage 97.61% 97.60% -0.02% ========================================== Files 40 40 Lines 8685 8682 -3 ========================================== - Hits 8478 8474 -4 - Misses 207 208 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

alexrockhill commented 8 months ago

Just wanted to add, I'm not trying to force this PR through, this is just my vision and addresses what I perceive to be pain points in the use case. Happy to discuss or close if people don't want to discuss and to just keep things how they are.

hoechenberger commented 8 months ago

I agree our API is hard to use.

jasmainak commented 8 months ago

I was just trying to explain what might have been a possible reason. It's important not to cause a regression bug. I'll let you figure out the best path forward with @hoechenberger !

agramfort commented 8 months ago

Allowing different ways to be valid means adding more magic to the code. You need to multiply test to support 1 or str 01 and also tests of if both are subjects etc. You create more corner cases in the code and complexity is exponential with the number of options. Any bad default you point out was maybe a natural default for someone with a different prior and this person did not complain here. I am not trying to force the status quo here but be mindful of these arguments.

alexrockhill commented 8 months ago

Totally, I asked why it was that way so the explanation is very helpful!

I think this would be backward compatible except if someone upgrades in the middle of converting a dataset which they aren't supposed to do.

It will silently return subject="01" if you have a dataset with subject="01" and subject="1" which I think is reasonable, again I hope this isn't a use case. But otherwise, this PR just takes integers and casts them to the BIDS recommendation which is the best of both worlds, I think (i.e. I just say subject=1 but that gets translated to subject="01" per BIDS recommendation). This obviates the need for the user to even think about zero padding which I would argue they probably don't want to and just want things to work and give readable results.

alexrockhill commented 8 months ago

@agramfort agreed but there are only two cases so it seems manageable.

And we already do this for run which causes unexpected results but if it were used everywhere it would be expected.

hoechenberger commented 7 months ago

I'm not decided whether I want to change anything :) I just wanted to say that I think our API is not very easy to use :)

sappelhoff commented 7 months ago

I agree that it's important to keep in mind what @agramfort said here: https://github.com/mne-tools/mne-bids/pull/1215#issuecomment-1887950194

If it were just me, I'd maybe improve the documentation about this, but I wouldn't start adding more "magic conversion" behind the scenes. The present behavior has been in place for a very long time and we rarely get issues about it (we did however, admittedly, get one about it not too long ago: https://mne.discourse.group/t/remove-automatic-zero-padding-when-specifing-run-in-bidspath/7865/4?u=sappelhoff).

hoechenberger commented 7 months ago

Also we kept on removing "magic" over the past years, because it would cause all sorts of problems

alexrockhill commented 7 months ago

My point was just that it makes sense to be consistent and have this magic everywhere (subject as well as run) if we want to enforce the recommendation or to have it nowhere.

Again, per the discourse use case comment/issue, people try and pass ints and by not handling it consistently, we're making them dive into these kinds of discussions zero padding etc. that are going to make people think BIDS is too much of a work-in-progress for them to use currently.

So, if it's about removing magic, then maybe just a straight cast to string for integers with no modification.

I think people are going to pass integers though and throwing them an error doesn't seem helpful since the information is fundamentally there.

EDIT: I think no zero-padding is a totally reasonable solution and we can adjust print_dir_tree etc. as well to order subjects, runs etc. more sensibly with s = int(s) if s.isdigit() else s.

alexrockhill commented 7 months ago

So now all the magic is removed from run and split as well which solves the discourse issue but will be slightly backward incompatible because if someone has run=1 in their code, they'll have to change it to run="01" to access an already-converted dataset.

To do (in follow-up PRs):

alexrockhill commented 7 months ago

I'm getting a bunch of pydantic import errors when I went to test mne_bids_pipeline...

Traceback (most recent call last):
  File "/home/alexrockhill/software/anaconda3/envs/mne/bin/pytest", line 8, in <module>
    sys.exit(console_main())
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/config/__init__.py", line 190, in console_main
    code = main()
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/config/__init__.py", line 148, in main
    config = _prepareconfig(args, plugins)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/config/__init__.py", line 329, in _prepareconfig
    config = pluginmanager.hook.pytest_cmdline_parse(
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/helpconfig.py", line 103, in pytest_cmdline_parse
    config: Config = outcome.get_result()
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1060, in pytest_cmdline_parse
    self.parse(args)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1348, in parse
    self._preparse(args, addopts=addopts)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1231, in _preparse
    self.pluginmanager.load_setuptools_entrypoints("pytest11")
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pluggy/_manager.py", line 287, in load_setuptools_entrypoints
    plugin = ep.load()
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/napari/utils/__init__.py", line 2, in <module>
    from napari.utils.colormaps import Colormap
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/napari/utils/colormaps/__init__.py", line 2, in <module>
    from napari.utils.colormaps.colormap import Colormap
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/napari/utils/colormaps/colormap.py", line 9, in <module>
    from napari.utils.events import EventedModel
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/napari/utils/events/__init__.py", line 17, in <module>
    from napari.utils.events.evented_model import EventedModel
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/napari/utils/events/evented_model.py", line 66, in <module>
    class EventedMetaclass(main.ModelMetaclass):
  File "/home/alexrockhill/software/anaconda3/envs/mne/lib/python3.10/site-packages/pydantic/_migration.py", line 306, in wrapper
    raise AttributeError(f'module {module!r} has no attribute {name!r}')
AttributeError: module 'pydantic.main' has no attribute 'ModelMetaclass'
alexrockhill commented 7 months ago

Could someone with a working version of mne-bids-pipeline test this and confirm that it doesn't break anything please?

larsoner commented 7 months ago

An easy way if you don't want to set it up yourself would be to open a PR to mne-bids-pipeline where you change pyproject.toml to tell it to install your branch of mne-bids:

https://github.com/mne-tools/mne-bids-pipeline/blob/main/pyproject.toml#L49

Something like mne-bids[full] @ git+https://github.com/alexrockhill/mne-bids@int rather than just mne-bids[full] should do it.

alexrockhill commented 7 months ago

Looks like it doesn't break anything in mne-bids-pipeline so looks good to go possibly

alexrockhill commented 6 months ago

LGTM?

hoechenberger commented 6 months ago

Unfortunately I currently don't have the time to really think through the implications of this change, so I'd like to withhold my vote for now

alexrockhill commented 6 months ago

Oh I was under the impression we talked it through and came to an intermediate compromise everyone was happy with. I.e. remove magic zero padding for all fields (it was on run before) and allow int for all fields but just cast it to string. That includes no magic and changes current behavior in a small way: i.e. run=1 just goes to run='1' not run='01'.

alexrockhill commented 6 months ago

Also worth noting, I'm really not opinionated as to whether subject/run=1 goes to subject/run="01" or subject/run='1', I think both are fine but I am opinionated that they be the same. Also, I made this PR because I find mne_bids quite annoying that it doesn't accept subject=1 (and extension='edf' not being allowed is also very annoying).

sappelhoff commented 6 months ago

I'll take some time and think through this within the next days, sorry for causing some frustration here.

alexrockhill commented 6 months ago

No worries, take your time, not frustrated by you, just a bit annoyed by how mne_bids works but that's what PRs are for!

alexrockhill commented 6 months ago

Thanks for getting back @sappelhoff , the changes look good and fix most of the pain points I was running into.

I understand the index/label distinction and the motivation between trying to have nice categories. However, in practice, I have observed that even for labels, people tend to use integers more than not (e.g. sub-1, ses-1). I feel like if people actually did use non-index labels, this wouldn't even be up for discussion because it would be very clear that of course you call your subject A, B, C etc. but that just isn't how the community has spontaneously organized, people picked indices.

I'll defer to your judgment though, thanks for discussing. Maybe it's an issue better taken upstream in the BIDS specification that that there even be this distinction in the first place since I feel it's a bit more aspirational than actually adopted, if I really care that much.

sappelhoff commented 6 months ago

However, in practice, I have observed that even for labels, people tend to use integers more than not

that's true, and that's also okay, because labels may be "alphanumeric" ... whereas indices are "non-negative integers".

(https://bids-specification.readthedocs.io/en/stable/common-principles.html#definitions)

You are right that people use sub-<label> as sub-01 often. But I have rarely seen task-<label> to be task-01 --> I think we don't really have a problem here. We could make a big PR and add the option to have int for each "label-entity", but I classify this as a high effort, low impact task.

Gonna merge now! :-)

Thanks for the discussion, I think the changes here improve the user experience!

hoechenberger commented 6 months ago

Thanks everyone!!

alexrockhill commented 6 months ago

Sure, I think that's a good place to leave the discussion for later but I'll just add a last comment that because task=1 makes no sense and is never used and sub=1 is the most common use case (note 1 not '01' from what I've seen from the community at large, BIDS core devs tend to use '01' which I think is a bit of a disconnect with the broader community) so it doesn't make sense to me that those are the same category although I agree alphanumeric is a fitting umbrella category, I wouldn't call sub a label in the way that task is since sub tends to be an index in practice. Which is to say it's a totally understandable and workable difference in opinions and maybe I'll just save this thread for when this viewpoint wins out in 5 years :)