mid3v2: custom delimiters in COMM, TXXX, POPM descriptions

GoogleCodeExporter commented 9 years ago

I use mid3v2 as a tool for skripted tagging of my music collection for Quod 
Libet, and I'm quite happy with it. There's just one limitation that I find a 
little irksome. A command like the following just won't work as intended:

mid3v2 --TXXX "QuodLibet::albumartist:The Examples" track.mp3

The reason, of course, are the double quotes in the prefix as used by Quod 
Libet (and Ex Falso), since they conflict with the delimiter colon in the 
description key for TXXX (as well as COMM and POMP) frames.

My first thought was to add some special handling for QuodLibet:: prefixes 
and/or double colons. A better idea, in my view, is to allow users to specify 
their own delimiters from the command line as needed, like this:

mid3v2 --delimiter=# --TXXX "QuodLibet::albumartist#The Examples" track.mp3

The colon would remain the default delimiter. There's no risk of breaking 
existing skripts with the added option. Backward compatibility with id3v2 would 
be preserved.

My tentative patch shows how mid3v2 (and its documentation) might be adapted 
for a new --delimiter option. I know very little python, so my code is 
certainly far from perfect. Most of my changes are built after the --verbose 
option.

Original issue reported on code.google.com by thuerrsc...@gmail.com on 30 Aug 2013 at 2:06

Attachments:

mid3v2-deliminator.diff

GoogleCodeExporter commented 9 years ago

Thanks for looking into this.

I'd prefer if it was possible to escape the delimiter instead of redefining it. 
To keep it compatible we could add an "-e" option like "/bin/echo" which 
enables this new behavior.

mid3v2 -e --TXXX "QuodLibet\:\:albumartist:The Examples" track.mp3

What do you think?

This also seems to be an unsolved problem in the original id3v2: 
http://www.jetmore.org/john/blog/2012/07/patch-allow-id3v21-to-include-colons-in
-comment-fields/

Original comment by reiter.christoph@gmail.com on 30 Aug 2013 at 11:44

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

An escaping mechanism would surely be just as good as supporting custom 
delimiters. In fact, when I first ran into the double colon problem, prefixing 
them with backslashes was one of the first things I tried. The reasons for my 
disappointed became clear when I looked into mid3v2's source ...

As for implementing escaping in python, I have no idea how to do that. Maybe 
there's a library function, since it's such a common task? Python's 
string.split() method doesn't seem to support regular expressions, which would 
offer one way of do this.

Original comment by thuerrsc...@gmail.com on 30 Aug 2013 at 5:27

GoogleCodeExporter commented 9 years ago

There is no such thing in the stdlib afaik.

Something like this should work: http://bpaste.net/show/hVJFGP0WtDJDbZp8WwbR/

Original comment by reiter.christoph@gmail.com on 31 Aug 2013 at 1:25

GoogleCodeExporter commented 9 years ago

I don't think we need a full-blown parser for our simple unescaping. A couple 
of regular expressions will do, I think. See the attached diff for a draft 
implementation.

I've added a split_value function that splits key defititions for COMM, TXXX, 
and POPM frames on non-escaped colons only, replaces double backslashes with 
single ones (so that \ or even \: can be written into tags), and removes any 
other backslashes. As I said, I have practically no Python experience (I'm just 
starting to like it), so please be lenient on my code.

So far there are no provisions for common escape sequences like \n for a 
newline or \t for a tab. We'd need a mapping table for those and/or some more 
regex trickery. I'm not sure if newlines are even allowed in ID3 tags. Maybe a 
"big" escaping mechanism would carry things too far. Just solving the colon 
problem is good enough for me.

If you decide to go for a escaping mechanism like the one proposed here, a note 
should be added to the release notes and/or man page that (a) escaped colons 
are supported now and (b) less obviously and possibly breaking existing stuff, 
that backslashes need to be doubled for the frames concerned.

Original comment by thuerrsc...@gmail.com on 2 Sep 2013 at 2:38

Attachments:

mutagen-escape-colon.diff

GoogleCodeExporter commented 9 years ago


> I don't think we need a full-blown parser for our simple unescaping. A
> couple of regular expressions will do, I think. See the attached diff
> for a draft implementation.

Why not? The regexp doesn't handle all cases and is imho harder to
understand.

> So far there are no provisions for common escape sequences like \n for
> a newline or \t for a tab. We'd need a mapping table for those and/or
> some more regex trickery. I'm not sure if newlines are even allowed in
> ID3 tags. Maybe a "big" escaping mechanism would carry things too far.
> Just solving the colon problem is good enough for me.

Python's string_escape codec should make this easy:

"foobar\\n".decode("string_escape") == "foobar\n"

Sounds like a good idea.

> If you decide to go for a escaping mechanism like the one proposed
> here, a note should be added to the release notes and/or man page that
> (a) escaped colons are supported now and (b) less obviously and
> possibly breaking existing stuff, that backslashes need to be doubled
> for the frames concerned. 

I would prefer the "-e" switch to turn it on and default to the old
behavior.

Original comment by reiter.christoph@gmail.com on 3 Sep 2013 at 5:22

GoogleCodeExporter commented 9 years ago

> Why not? The regexp doesn't handle all cases and is imho harder to
> understand.

I guess regexp vs. parser is a matter of personal preference. Coming from a 
perl background, I'm naturally biassed towards regular expressions, maybe too 
much so.

Anyway, my regexp was intended to handle just one case, i.e. colon delimiters 
escaped with a backslash. What we're talking about now is a more general 
escaping mechanism, optionally activated with an -e switch, that deals with a 
much larger set of escape sequences, for all kinds of ID3 tags.

I really like your ideas, although I know far too little about python's string 
handling to even try to implement them. From what I gather from 
http://docs.python.org/2/reference/lexical_analysis.html#string-literals, the 
string_escape encoding handles, among other things, \xhh sequences. So writing 
colons as \x3A would be an easy way of including them in TXXX etc. frames, 
without the need for any special escaping. Which would solve the initial 
problem quite elegantly.

Original comment by thuerrsc...@gmail.com on 3 Sep 2013 at 7:53

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 42885877ff28.

Original comment by reiter.christoph@gmail.com on 4 Sep 2013 at 10:51

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks for the quick fix! I haven't really tested the finished --escape option 
yet, but from I've seen, it looks like a clean and simple solution that just 
works. Well done!

Original comment by thuerrsc...@gmail.com on 4 Sep 2013 at 10:13

GoogleCodeExporter commented 9 years ago

Looks I've run into a problem:

 mid3v2 --TALB "aäā" -e test.mp3 
Traceback (most recent call last):
  File "/home/thuerrsch/bin/mid3v2", line 344, in <module>
    main(sys.argv)
  File "/home/thuerrsch/bin/mid3v2", line 328, in main
    write_files(parser.edits, args, options.escape)
  File "/home/thuerrsch/bin/mid3v2", line 148, in write_files
    edits = dict((k, map(unescape_string, v)) for (k, v) in edits.items())
  File "/home/thuerrsch/bin/mid3v2", line 148, in <genexpr>
    edits = dict((k, map(unescape_string, v)) for (k, v) in edits.items())
  File "/home/thuerrsch/bin/mid3v2", line 80, in unescape_string
    return string.decode("unicode_escape")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: 
ordinal not in range(128)

It seems that with the -e switch, all non-ASCII characters *must* be escaped. 
Is this intentional? I'd prefer the escaping to be optional, not mandatory for 
each character when -e is used. Would that be possible?

Original comment by thuerrsc...@gmail.com on 5 Sep 2013 at 4:00

GoogleCodeExporter commented 9 years ago

Thanks for catching this.

Original comment by reiter.christoph@gmail.com on 5 Sep 2013 at 8:28

GoogleCodeExporter commented 9 years ago

Should be fixed: revision 3a4f8997ce1d

thanks again

Original comment by reiter.christoph@gmail.com on 5 Sep 2013 at 9:03

GoogleCodeExporter commented 9 years ago

Sorry to keep bothering you. I found to more cases that lead to unexpected 
errors:

mid3v2 --TALB 'a vs. ä, \x61 vs. \xe4, \141 vs. \344' -e test.mp3
mid3v2 --TALB 'trailing backslash\' --TIT2 "trailing backslash\\" -e test.mp3

Thanks for looking into this.

Original comment by thuerrsc...@gmail.com on 5 Sep 2013 at 4:44

GoogleCodeExporter commented 9 years ago

Thanks! revision 7a3f88a0be2b

$ mid3v2 --TALB 'a vs. ä, \x61 vs. \xe4, \141 vs. \344' -e test.mp3
TALB: 'utf8' codec can't decode byte 0xe4 in position 16: invalid continuation 
byte

$ mid3v2 --TALB 'trailing backslash\' --TIT2 "trailing backslash\\" -e test.mp3
TALB: Trailing \ in string

Original comment by reiter.christoph@gmail.com on 5 Sep 2013 at 6:28

GoogleCodeExporter commented 9 years ago

Thanks again for the quick fix! Everything's fine now, as far as I can tell.

As an aside, the ä character used in my examples above would have to be 
escaped as \xc3\xa4 rather than \xe4 (UTF-8 byte sequence vs. character code). 
The former is interpreted correctly by mid3v2, the latter produces an error 
message as it should.

Original comment by thuerrsc...@gmail.com on 6 Sep 2013 at 3:30

leosongwei / mutagen

mid3v2: custom delimiters in COMM, TXXX, POPM descriptions #159