API update for tsv files

adam2392 commented 3 years ago

Describe the problem

With https://github.com/mne-tools/mne-bids/pull/601, an API for "updating" BIDs dataset was introduced in mne-bids. This however, only works for JSON files currently, which are easier because it can be represented as a dictionary update().

I have many datasets where additional information is appended though at the channel level, which are encoded in the channels.tsv, electrodes.tsv files. It would be awesome to have a similarly structured API for updating the corresponding .tsv files inside a BIDS dataset.

Describe your solution

I don't have one heh, but was hoping the brilliant minds here would have an idea on how to update a tsv file robustly and efficiently.

Describe possible alternatives

A naive way of solving the above problem would be to just have a nested dictionary as the dictionary update. That is:

an updating dictionary entries for a channels.tsv file would look like:

{
  'C1': {
      'status_description': 'blah',
},
   'C2': {
       'type': 'Electroencephalography',
},
}

Then the corresponding status_description and type are updated for channel rows "C1" and "C2". Everything else left as is.

Additional context

hoechenberger commented 3 years ago

My take would be a list of tuples:

new_vals = [('row_val_1', 'col_val_1', val_1),
            ('row_val_n', 'col_val_n', val_n)]

To select all rows or columns, we could use None:

new_vals = [(None, 'col_val_1', val_1),  # all rows
            ('row_val_n', None, val_n)]  # all cols

We need a way to specify which column shall serve as the index, so specifying a row value for indexing actually makes sense.

For example, to select all bad channels and change their status_description, we'd do:

new_vals = [('status', 'bad', 'status_description', 'super bad')]

but one can already see how this is going to be horrible to use.

I believe what we actually want is pandas-style indexing (i.e., DataFrame.loc)

agramfort commented 3 years ago

let's not reinvent the wheel here.

if we have a way to read it the events as a dataframe we can let users do what they want and then we offer them a way to write it back doing some schema checks.

my 2c

hoechenberger commented 3 years ago

Totally agree. I was just trying to think within the existing framework of the JSON updates.

But I would very much prefer using pandas here too.

We should not expose these implementation details to the users though, so let's discuss API :) I don't want to end up with one update function for JSON and another for TSV files, that's all

adam2392 commented 3 years ago

let's not reinvent the wheel here. if we have a way to read it the events as a dataframe we can let users do what they want and then we offer them a way to write it back doing some schema checks. my 2c

Agreed internally should definitely use pandas DataFrame. The API would be just for consistency sake, since I can see that there's a lot of use for updating a list of channels with some updated column value in channels.tsv/electrodes.tsv.

new_vals = [('row_val_1', 'col_val_1', val_1),
            ('row_val_n', 'col_val_n', val_n)]

To select all rows or columns, we could use None:

new_vals = [(None, 'col_val_1', val_1),  # all rows 
            ('row_val_n', None, val_n)]  # all cols

I like this approach.

We need a way to specify which column shall serve as the index, so specifying a row value for indexing actually makes sense.

I think for files such as channels.tsv/electrodes.tsv we can safely assume that the index is the name column? That way if you want to update the status and status_description?

new_vals = [(None, 'status', 'bad'), (None, 'status_description', 'super bad')]

agramfort commented 3 years ago

can you give me a concrete usage example for an event or channel tsv file?

hoechenberger commented 3 years ago

I'm also not yet totally convinced what exactly we're trying to achieve and how to get there :)

adam2392 commented 3 years ago

This would be a change to electrodes/channels.tsv. Say I need to change description of a channel, or set of channels to resected, or say left temporal lobe:

new_vals = [
('C1', 'description', 'resected'),
('C2', 'description', 'resected'),
(C40', 'description', 'left temporal lobe'),
]

# index column is assumed based on the fact that it's a 'channels.tsv' file to be 'name'
update_sidecar_tsv(bids_path, new_vals)

I acknowledge this can potentially be messy down the road, but minimally I feel like it would be useful to have a tsv reader/writer that might abstract away the pandas DataFrame reading?

hoechenberger commented 3 years ago

but minimally I feel like it would be useful to have a tsv reader/writer that might abstract away the pandas DataFrame reading?

I see the biggest advantage in the fact that a BIDSPath can essentially match more than just one TSV file (an entire study), which could be useful

jasmainak commented 3 years ago

to have a tsv reader/writer that might abstract away the pandas DataFrame reading?

is it more than 1 line of code to read pandas? We will add a dependency with no benefit. Can't you achieve the same with BIDSPath.match() and pd.read_csv(...) ?

hoechenberger commented 3 years ago

is it more than 1 line of code to read pandas? We will add a dependency with no benefit. Can't you achieve the same with BIDSPath.match() and pd.read_csv(...) ?

Yes, and it'd be more flexible too, I suppose.

But this raises the question why we offer JSON sidecar updating and not TSV sidecar updating?

agramfort commented 3 years ago

cause pandas does exist for JSON :)

what you describe is:

df = read_sidecar_tsv(bids_path)

new_vals = [ ('C1', 'description', 'resected'), ('C2', 'description', 'resected'), (C40', 'description', 'left temporal lobe'), ]

for val in new_vals:

df.at[val[0], val[1]] = val[2]

if we stick to such updates then yes maybe pandas is not necessary

and this could be done for all matching files.

hoechenberger commented 3 years ago

cause pandas does exist for JSON :)

Well.

# %%
import json_tricks
import pandas as pd

x = {
    'foo': 'bar',
    'baz': {
         'xxx': 1,
         'yyy': False
    }
}

x_json = json_tricks.dump(x, '/tmp/foo.json')

df = pd.read_json('/tmp/foo.json')
print(df)

     foo  baz
xxx  bar    1
yyy  bar    0

agramfort commented 3 years ago

:)

can you try on a json as specified by bids?

more seriously let's stick to TSV here

mne-tools / mne-bids