marl / jams

A JSON Annotated Music Specification for Reproducible MIR Research
ISC License
185 stars 26 forks source link

Implement trim function #136

Closed justinsalamon closed 8 years ago

justinsalamon commented 8 years ago

This PR will implement a trim function for jams objects as discussed in #135


This change is Reviewable

justinsalamon commented 8 years ago

Looks like this is ready for a CR. @bmcfee (or anyone else)?

justinsalamon commented 8 years ago

@bmcfee are you back from crunchland yet?

bmcfee commented 8 years ago

@bmcfee are you back from crunchland yet?

nope; next week okay?

justinsalamon commented 8 years ago

nope; next week okay?

yeah sure, thanks!

justinsalamon commented 8 years ago

@bmcfee I've updated the API again based on our latest discussion, refactoring trim() into trim() and slice(). I've documented the new expected behavior of each function in detail here, which is probably worth reading before giving this PR a CR, which I think it's ready for again.

bmcfee commented 8 years ago

Review status: 0 of 2 files reviewed at latest revision, 16 unresolved discussions.


jams/core.py, line 47 at r2 (raw file):

import gzip
import datetime

why datetime?


jams/core.py, line 897 at r2 (raw file):

        of tuples to the annotation's sandbox keyed by `sandbox.trim` which
        documents each trim operation with a tuple `(start_time, end_time,
        trim_start, trim_end)`.

Please break this up into a short (one-line) description and then a more structured, detailed description.

Other notes:


jams/core.py, line 939 at r2 (raw file):


        # The annotation must have it's duration value set for trimming
        if self.duration is None:

Why is this an error? The semantics of duration=None are that the annotation implicitly spans the full track duration as specified in file_metadata.


jams/core.py, line 990 at r2 (raw file):

                                       confidence=obs['confidence'])

        if 'trim' not in ann_trimmed.sandbox.keys():

you don't need to call .keys() here


jams/core.py, line 995 at r2 (raw file):

        else:
            ann_trimmed.sandbox.trim.append(
                (start_time, end_time, trim_start, trim_end))

I think trim history would make more sense as a list of dictionaries, rather than a list of tuples.

Also, your annotation construction seems to be a copy-by-reference of sandbox, so I think this will also populate the trim field in the input annotation's sandbox. This is probably not what you intended.


jams/core.py, line 1015 at r2 (raw file):

        trimming operation will also be documented in `Annotation.sandbox.trim`
        as described in `Annotation.trim`.

Again, provide a short (1-sentence) docstring and then a detailed description.


jams/core.py, line 1023 at r2 (raw file):

        ----------
        start_time : float
            The desired start time for slicing.

+"in seconds"


jams/core.py, line 1039 at r2 (raw file):

        sliced_ann : Annotation
            The sliced annotation.

Add a "see also" to trim

Add an example


jams/core.py, line 1050 at r2 (raw file):

        adjustment = ref_time - sliced_ann.time

        if sliced_ann.data is not None:

why would this ever be None?


jams/core.py, line 1051 at r2 (raw file):


        if sliced_ann.data is not None:
            for idx in range(len(sliced_ann.data.index)):

This isn't the correct way to iterate over dataframe indices.

Instead, do

for idx in sliced_ann.data.index:

or avoid iteration altogether:

sliced_ann.data['time'] = sliced_ann.data['time'].apply(lambda x: x - pd.to_timedelta(adjustment, unit='s')

jams/core.py, line 1058 at r2 (raw file):

        slice_end = ref_time + sliced_ann.duration

        if 'slice' not in sliced_ann.sandbox.keys():

no need for .keys(), and same comments as in trim above.


jams/core.py, line 1561 at r2 (raw file):

            raise JamsError(
                'Duration must be set (jam.file_metadata.duration) before '
                'trimming can be performed.')

I don't see why this should be required here.


jams/core.py, line 1567 at r2 (raw file):

                start_time > float(self.file_metadata.duration) or
                end_time < start_time or
                end_time > float(self.file_metadata.duration)):

fun fact: python supports triple comparison, so you can do not (0 <= start_time <= self.file_metadata.duration)


jams/core.py, line 1576 at r2 (raw file):

        jam_trimmed = JAMS(annotations=None,
                           file_metadata=self.file_metadata,
                           sandbox=self.sandbox)

I think you want a deep copy here, not copy-by-reference.


jams/core.py, line 1583 at r2 (raw file):


        # Document jam-level trim in top level sandbox
        if 'trim' not in jam_trimmed.sandbox.keys():

-.keys()


_tests/jams_test.py, line 732 at r2 (raw file):_


def test_annotation_trim():

Please break this test into smaller test functions for each test case.


Comments from Reviewable

justinsalamon commented 8 years ago

Review status: 0 of 2 files reviewed at latest revision, 16 unresolved discussions.


jams/core.py, line 47 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > why datetime? >

Thought I needed it for some conversions but ended up using pandas instead. Removed.


_jams/core.py, line 897 at r2 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > Please break this up into a short (one-line) description and then a more structured, detailed description. > > Other notes: > - `self` is not an externally meaningful variable name for object methods. > - break out the warning and exception cases with bullets so they're easily identified >

Removed references to self, broken up docstring into more manageable paragraphs, and added examples.


jams/core.py, line 939 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > Why is this an error? The semantics of `duration=None` are that the annotation implicitly spans the full track duration as specified in `file_metadata`. >

I need to know the time range spanned by the annotation to compute the intersection with the trim range specified by the user, for which I need Annotation.duration.

Since an Annotation object can exist independently of a JAMS object, while duration=None means "use the jams duration", there's no way for me to access this value from inside this function (or the value may not exist).


_jams/core.py, line 990 at r2 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > you don't need to call `.keys()` here >

Tried without but it breaks (problem in sandbox implementation?), leaving in for now.


jams/core.py, line 995 at r2 (raw file): OK, I've switched the history to be a list of dicts.

Regarding the sandbox, I don't think it's copied by reference:

        # Create new annotation with same namespace/metadata
        ann_trimmed = Annotation(
            self.namespace,
            data=None,
            annotation_metadata=self.annotation_metadata,
            sandbox=self.sandbox,
            time=trim_start,
            duration=trim_end - trim_start)

If you try it out the sandbox of the original annotation is unchanged:

>>> ann = jams.Annotation(namespace='tag_open', time=2, duration=8)
>>> ann_trim = ann.trim(5, 8)
>>> print(ann_trim.sandbox)
{
  "trim": [
    {
      "start_time": 5, 
      "trim_start": 5, 
      "end_time": 8, 
      "trim_end": 8
    }
  ]
}
>>> print(ann.sandbox)
{}

jams/core.py, line 1015 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > Again, provide a short (1-sentence) docstring and then a detailed description. >

Fixed


jams/core.py, line 1023 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > +"in seconds" >

Added


jams/core.py, line 1039 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > Add a "see also" to trim > > Add an example >

Added see also and example.


_jams/core.py, line 1050 at r2 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > why would this ever be `None`? >

When I was playing around with this during initial development I seem to have reached this case, but I can't reproduce it, so removing the check for None.


[jams/core.py, line 1051 at r2](https://reviewable.io:443/reviews/marl/jams/136#-KVpHDoOOuzgPgC4xhjY:-KW4gpwTC7M22YwPsd8:b-z1ojeq) (raw file):_

Previously, bmcfee (Brian McFee) wrote… > This isn't the correct way to iterate over dataframe indices. > > Instead, do > > ``` python > for idx in sliced_ann.data.index: > ``` > > or avoid iteration altogether: > > ``` python > sliced_ann.data['time'] = sliced_ann.data['time'].apply(lambda x: x - pd.to_timedelta(adjustment, unit='s') > ``` > >

Switched to lambda function implementation


_jams/core.py, line 1058 at r2 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > no need for .keys(), and same comments as in `trim` above. >

As noted above with .keys() the call breaks, leaving as is for now.


jams/core.py, line 1561 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > I don't see why this should be required here. >

You need it to perform the next check: if not (0 <= start_time <= end_time <= float(self.file_metadata.duration)):

We could do away with this check, but then it would be possible to call JAMS.trim with a time range that doesn't (or only partially) intersects with the jam. Would that make sense?


_jams/core.py, line 1567 at r2 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > fun fact: python supports triple comparison, so you can do `not (0 <= start_time <= self.file_metadata.duration)` >

Nice, so I can actually do: if not (0 <= start_time <= end_time <= float(self.file_metadata.duration)):


_jams/core.py, line 1576 at r2 (raw file):_ I think this does result in a deep copy, no?

>>> jam = jams.JAMS()
>>> jam.file_metadata.duration = 10
>>> jam_trim = jam.trim(5, 8)
>>> print(jam.sandbox)
{}
>>> print(jam_trim.sandbox)
{
  "trim": [
    [
      5, 
      8
    ]
  ]
}

jams/core.py, line 1583 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > -.keys() >

See previous comments, leaving as is for now.


_tests/jams_test.py, line 732 at r2 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > Please break this test into smaller test functions for each test case. >

Broke it.


Comments from Reviewable

bmcfee commented 8 years ago

Reviewed 1 of 3 files at r3. Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion.


jams/core.py, line 939 at r2 (raw file):

Previously, justinsalamon (Justin Salamon) wrote… > I need to know the time range spanned by the annotation to compute the intersection with the trim range specified by the user, for which I need `Annotation.duration`. > > Since an Annotation object can exist independently of a JAMS object, while `duration=None` means "use the jams duration", there's no way for me to access this value from inside this function (or the value may not exist). >

Got it.

I think it's also reasonable, in this case, to just override the time/duration fields with the trim interval. Before trimming, the annotation spans whatever the jams file spans; afterwards, it explicitly spans the specified interval.

The only ambiguous case is when the trim interval overruns file_metadata.duration. I think this is detectable at the level of JAMS.trim, but it's fine to let it slide at the annotation level.

The reason I'm picking on this point is that a lot of annotations do have implicit span, in which case we'd be unable to use trim/slice.


Comments from Reviewable

justinsalamon commented 8 years ago

Review status: 1 of 3 files reviewed at latest revision, 1 unresolved discussion.


jams/core.py, line 939 at r2 (raw file):

Previously, bmcfee (Brian McFee) wrote… > Got it. > > I think it's also reasonable, in this case, to just override the time/duration fields with the trim interval. Before trimming, the annotation spans whatever the jams file spans; afterwards, it explicitly spans the specified interval. > > The only ambiguous case is when the trim interval overruns file_metadata.duration. I think this is detectable at the level of `JAMS.trim`, but it's fine to let it slide at the annotation level. > > The reason I'm picking on this point is that a lot of annotations do have implicit span, in which case we'd be unable to use trim/slice. >

Yeah I see your point. Like you say it can lead to some weird behavior (worse than the trim interval overrunning file_metadata.duration, it might not even intersect with it at all...). But I guess overall it's better to allow this to happen and trust the user will avoid it, at the benefit of being able to trim annotations that don't have an explicit duration set. Implemented.


Comments from Reviewable

bmcfee commented 8 years ago

Reviewed 1 of 2 files at r4. Review status: 2 of 3 files reviewed at latest revision, 2 unresolved discussions.


_tests/jams_test.py, line 983 at r4 (raw file):_

def test_jams_trim():

    @raises(jams.ParameterError, jams.JamsError)

do you really need both errors here? it would be better to have separate test cases for separate error triggers. A catch-all for our package's base exception class is dangerous and could mask other errors.


_tests/jams_test.py, line 997 at r4 (raw file):_

    trim_times = [(-5, -1), (-5, 10), (-5, 20), (5, 20), (18, 20), (10, 8)]
    for tt in trim_times:
        __test_error(jam, *tt)

if you yield __test_error, jam, tt, each of these cases will show up separately. this makes debugging easier.


Comments from Reviewable

justinsalamon commented 8 years ago

Review status: 2 of 3 files reviewed at latest revision, 2 unresolved discussions.


_tests/jams_test.py, line 983 at r4 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > do you really need both errors here? it would be better to have separate test cases for separate error triggers. A catch-all for our package's base exception class is dangerous and could mask other errors. >

OK, I've broken these out into separate tests.


_tests/jams_test.py, line 997 at r4 (raw file):_

Previously, bmcfee (Brian McFee) wrote… > if you `yield __test_error, jam, tt`, each of these cases will show up separately. this makes debugging easier. >

Sure, changed.


Comments from Reviewable

bmcfee commented 8 years ago
:lgtm:

Reviewed 1 of 1 files at r5. Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable