Closed ejhumphrey closed 7 years ago
After continuing to study our current implementation, I realize this is no minor surgery, and have found a work-around for the time being. That said, I would still lean on this for consideration / discussion.
maybe it's not a bug, but type(ann.data) != type(ann.data[:3])
is certainly surprising.
also I totally forgot that the built-in version of data
is a pandas-style dict ({col0: values, col1:values, ...}
) and not a list of observations since the JamsFrame happened...
It's not obvious to a user that ann.data is a subclassed JamsFrame and not a DataFrame. However, direct reassignment is currently allowed (ann.data = ann.data[:20]), which changes the object type (JamsFrame -> DataFrame) and breaking things later.
Direct reassignment is always allowed, unless we implement property guards on everything. I'm not necessarily opposed to doing that, but it does seem overkill.
My proposal:
Either (a) make an Annotation object directly iterable (with an iter & next), such that it returns references to Observation objects like above, or (b) add an explicit ann.observations(copy=False) method that allows for one to iterate over the content in an Annotation.
This would violate dataframe semantics, which already define iteration as column-wise. I don't think we should do this.
Leave the internal (private) data structure of observations as a pure-Python object, i.e. a list.
This would lose a lot of power in terms of slicing and alignment.
I never much liked the JamsFrame
idea, but it was the only way I could see to make a dataframe with a pre-determined column set. OTOH, if we implement data
as a property, the setter method could handle checking and validation, so maybe that's a better approach.
Make data an on-demand view of ann._observations rather than an attribute, achieved by decorating data as a @property. We could also add a setter, which would coerce DataFrame-like objects to be JamsFrames (via from_dataframe), helping to avoid user error.
I'd be okay with this.
My general advice here is to treat annotations as immutable unless you have a real strong reason not to. For the most part, annotations are small, so copy-edit is cheap.
I'll be honest, I failed to appreciate the true depth of the situation, and yea, there are some major architectural considerations here. The Right Answer™️ is the one that only extends the current interface (because I agree it is powerful quick for many things).
Like I said, I am unblocked, but will continue mulling this over and look forward to chatting about this in the future (Rochester?).
just for completeness ... this is what I came up with while avoiding pandas and it's obtuse view/copy rules. Any suggestions for simplifying this?
def fix_labfile(fname, tag_map, namespace="tag_open"):
jam, ann = jams.util.import_lab(namespace, fname)
ann_fixed = jams.Annotation(
ann.namespace,
data=None,
annotation_metadata=ann.annotation_metadata,
sandbox=ann.sandbox,
time=ann.time,
duration=ann.duration)
for idx, obs in ann.data.iterrows():
ann_fixed.append(time=obs['time'].total_seconds(),
duration=obs['duration'].total_seconds(),
value=tag_map.get(obs['value'], 'unknown'),
confidence=obs['confidence'])
jam.annotations[0] = ann_fixed
return jam, ann_fixed
I would prefer this, but can't get it to fly without complaining...
def fix_labfile(fname, tag_map, namespace="tag_open"):
jam, ann = jams.util.import_lab(namespace, fname)
ann.data['value'] = [tag_map.get(tag, 'unknown') for tag in ann.data.value]
return jam, ann
What about
ann.data['value'] = ann.data['value'].apply(lambda x: tag_map.get(x, 'unknown'))
ack, my apologies, I copied the wrong thing (trying to strip down to the shareable parts)
so this is okay...
def fix_labfile(fname, tag_map, keep_label, namespace="tag_open"):
jam, ann = jams.util.import_lab(namespace, fname)
ann.data['value'] = [tag_map.get(tag, 'unknown') for tag in ann.data.value]
ann.data.from_dataframe(ann.data)
return jam, ann
but this is not...?
def fix_labfile(fname, tag_map, namespace="tag_open"):
jam, ann = jams.util.import_lab(namespace, fname)
ann.data['value'] = [tag_map.get(tag, 'unknown') for tag in ann.data.value]
data = ann.data.loc[ann.data.value == 'vocals']
ann.data.from_dataframe(data)
return jam, ann
throwing ...
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
which makes no sense to me...?
oh and to be clear, this line doesn't cause the warning, which is what I thought the problem is / was:
data = ann.data.loc[ann.data.value == 'vocals']
rather, it's the one after
ann.data.from_dataframe(data)
...
In light of #150 and #149 , I think we close this issue out and proclaim that observations are immutable objects. If we're dropping pandas backing, much of the discussion in this thread becomes moot.
@ejhumphrey is this one now resolved, or is there anything else you'd like added?
Had another thought on this one today: should we make Annotation.data
a property? That way, we can have a setter that protects the data
field type as SortedListWithKey
, but otherwise populates the array directly.
[ping @ejhumphrey ]
eh?
ya, i think we cool. waving goodbye to pandas definitely fixes this ... I have seen some hiccups bringing some old code into the future, but that's not related to this (and probably my own fault).
Alright. I'll close this one out then, but reopen it if something specific to data mutation crops up.
Before making sawdust and creating a PR, I wanted to discuss the (my?) problem, and how I plan to address it. In the following, I use
ann
to refer to an instance of anAnnotation
, andobs
to refer to an instance ofObservation
.The situation I'm facing:
ann.data.from_dataframe
.A few observations:
[fx(obs) for obs in ann]
data
object is persistent and owned by the annotation, which makes mutating it directly seem like an option.ann.data
is a subclassedJamsFrame
and not aDataFrame
. However, direct reassignment is currently allowed (ann.data = ann.data[:20]
), which changes the object type (JamsFrame -> DataFrame
) and breaking things later.My proposal:
Annotation
object directly iterable (with an__iter__
&__next__
), such that it returns references toObservation
objects like above, or (b) add an explicitann.observations(copy=False)
method that allows for one to iterate over the content in anAnnotation
.data
an on-demand view ofann._observations
rather than an attribute, achieved by decoratingdata
as a@property
. We could also add a setter, which would coerceDataFrame
-like objects to be JamsFrames (viafrom_dataframe
), helping to avoid user error.thoughts?
update: I see now there are two options: update a dataframe and reassign, or step through the iterrows of the JamsFrame and create a new annotation, like in Annotation.trim.