marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
186 stars 27 forks source link

Mutating Annotation data / JamsFrames #141

Closed ejhumphrey closed 7 years ago

ejhumphrey commented 7 years ago

Before making sawdust and creating a PR, I wanted to discuss the (my?) problem, and how I plan to address it. In the following, I use ann to refer to an instance of an Annotation, and obs to refer to an instance of Observation.

The situation I'm facing:

A few observations:

My proposal:

  1. Either (a) make an Annotation object directly iterable (with an __iter__ & __next__), such that it returns references to Observation objects like above, or (b) add an explicit ann.observations(copy=False) method that allows for one to iterate over the content in an Annotation.
  2. Leave the internal (private) data structure of observations as a pure-Python object, i.e. a list.
  3. Make data an on-demand view of ann._observations rather than an attribute, achieved by decorating data as a @property. We could also add a setter, which would coerce DataFrame-like objects to be JamsFrames (via from_dataframe), helping to avoid user error.

thoughts?

update: I see now there are two options: update a dataframe and reassign, or step through the iterrows of the JamsFrame and create a new annotation, like in Annotation.trim.

ejhumphrey commented 7 years ago

After continuing to study our current implementation, I realize this is no minor surgery, and have found a work-around for the time being. That said, I would still lean on this for consideration / discussion.

ejhumphrey commented 7 years ago

maybe it's not a bug, but type(ann.data) != type(ann.data[:3]) is certainly surprising.

ejhumphrey commented 7 years ago

also I totally forgot that the built-in version of data is a pandas-style dict ({col0: values, col1:values, ...}) and not a list of observations since the JamsFrame happened...

bmcfee commented 7 years ago

It's not obvious to a user that ann.data is a subclassed JamsFrame and not a DataFrame. However, direct reassignment is currently allowed (ann.data = ann.data[:20]), which changes the object type (JamsFrame -> DataFrame) and breaking things later.

Direct reassignment is always allowed, unless we implement property guards on everything. I'm not necessarily opposed to doing that, but it does seem overkill.

My proposal:

Either (a) make an Annotation object directly iterable (with an iter & next), such that it returns references to Observation objects like above, or (b) add an explicit ann.observations(copy=False) method that allows for one to iterate over the content in an Annotation.

This would violate dataframe semantics, which already define iteration as column-wise. I don't think we should do this.

Leave the internal (private) data structure of observations as a pure-Python object, i.e. a list.

This would lose a lot of power in terms of slicing and alignment.

I never much liked the JamsFrame idea, but it was the only way I could see to make a dataframe with a pre-determined column set. OTOH, if we implement data as a property, the setter method could handle checking and validation, so maybe that's a better approach.

Make data an on-demand view of ann._observations rather than an attribute, achieved by decorating data as a @property. We could also add a setter, which would coerce DataFrame-like objects to be JamsFrames (via from_dataframe), helping to avoid user error.

I'd be okay with this.

bmcfee commented 7 years ago

My general advice here is to treat annotations as immutable unless you have a real strong reason not to. For the most part, annotations are small, so copy-edit is cheap.

ejhumphrey commented 7 years ago

I'll be honest, I failed to appreciate the true depth of the situation, and yea, there are some major architectural considerations here. The Right Answer™️ is the one that only extends the current interface (because I agree it is powerful quick for many things).

Like I said, I am unblocked, but will continue mulling this over and look forward to chatting about this in the future (Rochester?).

ejhumphrey commented 7 years ago

just for completeness ... this is what I came up with while avoiding pandas and it's obtuse view/copy rules. Any suggestions for simplifying this?

def fix_labfile(fname, tag_map, namespace="tag_open"):
    jam, ann = jams.util.import_lab(namespace, fname)
    ann_fixed = jams.Annotation(
        ann.namespace,
        data=None,
        annotation_metadata=ann.annotation_metadata,
        sandbox=ann.sandbox,
        time=ann.time,
        duration=ann.duration)

    for idx, obs in ann.data.iterrows():
        ann_fixed.append(time=obs['time'].total_seconds(),
                         duration=obs['duration'].total_seconds(),
                         value=tag_map.get(obs['value'], 'unknown'),
                         confidence=obs['confidence'])
    jam.annotations[0] = ann_fixed
    return jam, ann_fixed

I would prefer this, but can't get it to fly without complaining...

def fix_labfile(fname, tag_map, namespace="tag_open"):
    jam, ann = jams.util.import_lab(namespace, fname)
    ann.data['value'] = [tag_map.get(tag, 'unknown') for tag in ann.data.value]
    return jam, ann
bmcfee commented 7 years ago

What about

ann.data['value'] = ann.data['value'].apply(lambda x: tag_map.get(x, 'unknown'))
ejhumphrey commented 7 years ago

ack, my apologies, I copied the wrong thing (trying to strip down to the shareable parts)

so this is okay...

def fix_labfile(fname, tag_map, keep_label, namespace="tag_open"):
    jam, ann = jams.util.import_lab(namespace, fname)
    ann.data['value'] = [tag_map.get(tag, 'unknown') for tag in ann.data.value]
    ann.data.from_dataframe(ann.data)
    return jam, ann

but this is not...?

def fix_labfile(fname, tag_map, namespace="tag_open"):
    jam, ann = jams.util.import_lab(namespace, fname)
    ann.data['value'] = [tag_map.get(tag, 'unknown') for tag in ann.data.value]
    data = ann.data.loc[ann.data.value == 'vocals']
    ann.data.from_dataframe(data)
    return jam, ann

throwing ...

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

which makes no sense to me...?

ejhumphrey commented 7 years ago

oh and to be clear, this line doesn't cause the warning, which is what I thought the problem is / was:

data = ann.data.loc[ann.data.value == 'vocals']

rather, it's the one after

ann.data.from_dataframe(data)

...

bmcfee commented 7 years ago

In light of #150 and #149 , I think we close this issue out and proclaim that observations are immutable objects. If we're dropping pandas backing, much of the discussion in this thread becomes moot.

bmcfee commented 7 years ago

@ejhumphrey is this one now resolved, or is there anything else you'd like added?

bmcfee commented 7 years ago

Had another thought on this one today: should we make Annotation.data a property? That way, we can have a setter that protects the data field type as SortedListWithKey, but otherwise populates the array directly.

bmcfee commented 7 years ago

[ping @ejhumphrey ]

ejhumphrey commented 7 years ago

eh?

ya, i think we cool. waving goodbye to pandas definitely fixes this ... I have seen some hiccups bringing some old code into the future, but that's not related to this (and probably my own fault).

bmcfee commented 7 years ago

Alright. I'll close this one out then, but reopen it if something specific to data mutation crops up.