quodlibet / mutagen

Python module for handling audio metadata
https://mutagen.readthedocs.io
GNU General Public License v2.0
1.54k stars 158 forks source link

Batch processing m4a file, system crash #463

Open MayuraVerma opened 4 years ago

MayuraVerma commented 4 years ago

I am batch processing 1.25 TB, 92500 files, each file around 10-12MB, in python on mac. All the songs are stored in Samsung EVO SSD It loops through the audio songs and apply tags. Macbook Pro restarts with white screen after looping through few hundred files.

I am suspecting the mutagen is writing the "entire file to drive" and drive running out of cache and crashing the system.

If this is true, shouldn't the mutagen open just the metadata and save the only delta, instead of write the entire file.

lazka commented 4 years ago

If this is true, shouldn't the mutagen open just the metadata and save the only delta, instead of write the entire file.

We do, if there is enough space in the file. The second change definitely does because we add extra space for future changes if we have to rewrite the file.

MayuraVerma commented 4 years ago

We do, if there is enough space in the file. The second change definitely does because we add extra space for future changes if we have to rewrite the file.

The music files already has metadata. I am not adding any new fields, but making changes in the value.

Should I clear the existing meta first, and then write the data?

lazka commented 4 years ago

Changes to the value can also lead to metadata needing more space (one more character is enough), and even if nothing changes we might need slightly more space because we might write things back slightly differently then other programs.

So only if the second update needs a rewrite there is a bug, everything else is expected.

lazka commented 4 years ago

Not sure how I can help here sadly. Ideas welcome.

MayuraVerma commented 4 years ago

Could we add three function

  1. to check available bytes in audio file for metadata or audio and metadata size.
  2. to check the size of metadata that user is trying to add before save function.
  3. Allow function to allocate user defined size for metadata for future proof.
MayuraVerma commented 4 years ago

With these function, user can query all files in a loop and get the available space. Calculate the new metadata size. Compare two numbers, estimate how many files will be re-written. Drop fields that are not important, re-calculate the space required.

lazka commented 4 years ago

Have a look at https://mutagen.readthedocs.io/en/latest/user/padding.html

It doesn't provide everything you ask for though. For any new features we'd need to have a good use case first.

MayuraVerma commented 4 years ago

This is what I did.

def max_padding(info): return 1000000

I padded the m4a with 1000000 bytes. Saved the files : audio.save(padding=max_padding) Then, added couple of new fields audio["----:TXXX:TEST7"] = bytes("TEST7", 'UTF-8') audio["----:TXXX:TEST8"] = bytes("TEST8", 'UTF-8') audio.save() and saved the files with no padding option.

File size was reduced, because the function calculated the size of new fields and apply default padding. Files are rewritten to the drive with smaller size (perhaps applied default padding). It didn't just add the new fields and retained the remaining empty space. How to make use of the custom large padding? recalculate the size of the metadata and padding, apply the remaining padding as custom padding?

We need:

  1. how to calculate the "the size of metadata that user is trying to add before save function.". I mean the size of new set data.

  2. To know the absolute size of audio and metadata envelope Currently, we can assign delta padding to metadata.

  3. Flag in the save function to retain the file size, not use default padding.

lazka commented 4 years ago

How to make use of the custom large padding?

def mypadding(info):
    if info.padding < 0:  # in case there is no padding left after write add lots
        return 1000000
    else: # if there is some left, just use it
        return info.padding
jmeosbn commented 4 years ago

I ended up doing something similar but it would be nice to have a setting such as "don't increase padding if no need" and "don't decrease ever".

This could be presented as a "finalise" option in the options dict for save() - to avoid accidental use - for when you don't intend to do anything more in future with the file (other than fixing typos etc.).

My custom padding starts off with:

    default = info.get_default_padding()

    # only accept increase in padding
    calculated = max(info.padding, default)

    # difference between desired and current
    delta = calculated - info.padding

You can keep track of expected album art sizes and mutagen will give you some sizes of tags etc.

It is DIY atm though so there could be improvements here - I'd offer a PR but I can barely remember the details of it atm!!

jmeosbn commented 4 years ago

Forgot to mention that this issue rang a bell for me and I remembered now, it was when scripting the adding of album art to large files and normally requiring re-write.

I added a delay as it was only a very short temp script.

MayuraVerma commented 4 years ago

I added a delay as it was only a very short temp script.

I am doing the same. But unfortunately, I have around 900 GB of data. SSD has only 30GB SLC, rest is MLC. It would take for ever to modify.

I might have to modify the data at least 3 times in next few months. It would be a very length process.

jmeosbn commented 4 years ago

I'm removing artwork first and then writing my text data with that free space. No need to rewrite most of the time if this fits your situation.

MayuraVerma commented 4 years ago
def mypadding(info):
    if info.padding < 0:  # in case there is no padding left after write add lots
        return 1000000
    else: # if there is some left, just use it
        return info.padding

This would work. Is it possible to calculate size of the metadata? This way, I can reasonable size for padding. 1MB padding on 95000 songs, is 95GB overhead.

I can however remove the tag, save the file. And apply the tag with default padding, save the file and compare the difference. It would be nice to query the size of metadata in a program.

jmeosbn commented 4 years ago

MP3 has tags.size() but you need to calculate for mp4, is that some parity that could be added @lazka ?

Atm, I'm getting the atoms and seeing what size they are... again, I'd love to be able to do this in an "approved" manner as I don't think atoms for an opened file are made accessed?

jmeosbn commented 4 years ago

I've opened the file using Atoms() then looped through to add up tag atoms, then the one mdat atom subtracted from the file size will include padding and stuff too.

Note: incomplete but just the basics:

atoms = Atoms(fileobj)

# size of first track data
atoms.path(b"mdat")[0].length
jmeosbn commented 4 years ago

@lazka Quick question: one thing that puts me off doing the above, is that while I think it should just efficiently use the memory already read (for the still open File), when re-opening, I don't know that it does. Likewise for re-opening as raw - it does seem efficient but I have not tested or looked into it.

So what must one be aware of when having a second read only "view" of the atoms, or re-opening a just closed file* in order to read some tags raw.

*I'm assuming that closing one writable instance before opening another would be required?

MayuraVerma commented 4 years ago

I'm removing artwork first and then writing my text data with that free space. No need to rewrite most of the time if this fits your situation.

nope, actually I need the artwork. If I can calculate the space in a loop, I can prioritize some fields.

lazka commented 4 years ago

So we are all on the same page, this is what the default does: https://github.com/quodlibet/mutagen/blob/eee54961a15620a83ca28e9762d21ccbad781d47/mutagen/_tags.py#L44 It tries to add more padding if rewriting the file would take longer.

Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?

I don't know that it does. Likewise for re-opening as raw - it does seem efficient but I have not tested or looked into it

The operating system will keep most things from disk in RAM, so this should be fairly fast, but ymmv.

*I'm assuming that closing one writable instance before opening another would be required?

Yes

MayuraVerma commented 4 years ago

Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?

This would help to understand the size of the tags in the file, and free size. By calculating the size of new set of data, I continue to drop some fields until, tags are written without resizing the file.

We need a way to know the size of the tag, padding outside save function and calculate size of new set of tags.

jmeosbn commented 4 years ago

Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?

This would be useful for calculating average bitrate of audio data vs. audio+tags; pre-guessing if padding is likely to be needed etc.

That's off the top of my head but It's useful enough that I made my own version. 😉

Another one I've hacked in for now and would be useful here is the _padding property on on the MP4 file instance.

On the MP3 type (and others?) it's not passed through to the main File from _EasyID3__id3 as it is for mp4.

It'd be really handy to have this info available on all file types.

The operating system will keep most things from disk in RAM, so this should be fairly fast, but ymmv.

Ah, seems like caution is warranted then. The reason I ask is that I got some ways into a project using the Easy* tags and then finding if I want to do anything with covers and sizes etc. I need to open with "raw" tags (easy=false).

I've thought of rewriting and just using raw, but honestly, I've put some time in to extending the easy tag definitions (something I hope to put in a PR if you're interested?). Are there any plans to make either the artwork or the raw tags accessible from easy tags?

They already are passed via the Register... functions, but not to inspect via the instance interface. That could obviate the need to access raw, if the size and padding size etc. is accessible.

lazka commented 4 years ago

Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?

This would help to understand the size of the tags in the file, and free size. By calculating the size of new set of data, I continue to drop some fields until, tags are written without resizing the file.

We need a way to know the size of the tag, padding outside save function and calculate size of new set of tags.

That seems like a very special use case and not something I'd like to support. But I still don't know why you need this, so not sure.

lazka commented 4 years ago

This would be useful for calculating average bitrate of audio data vs. audio+tags

You can already get the overhead by doing filesize - (bitrate * audio length)

Are there any plans to make either the artwork or the raw tags accessible from easy tags?

Not that I know of.

jmeosbn commented 4 years ago

You can already get the overhead by doing filesize - (bitrate * audio length)

Ah, that's good to know - when I first started using mutagen that number seemed to be calculated differently, now it's pretty close to what I calculate.

# info.bitrate
62794

# my calculation
62794.28581226417

While this is good for the bitrate, since it is rounded, it will be out by several kb on a medium sized file (~300MB).

jmeosbn commented 4 years ago

Here are the sizes for the above:

218057642  # total

# audio data - inferred size from bitrate
215625061.56408164

# audio data - using size of 'mdat'
215626043

The padding is ~2048~ 59541 (likely due to previously removed artwork).

Subtracting the audio data size only gives the size for "everything else" - tags, artwork, padding plus any other tracks inc. qt chapters, subtitles, other audio/video data etc.

MayuraVerma commented 4 years ago

You can already get the overhead by doing filesize - (bitrate * audio length)

How accurate is this?

Is it possible to please provide to two data. audio size? tag size? outside save function.

jmeosbn commented 4 years ago

I think it used to be the "nominal" bitrate from the header - but I've tried a few mp3 and m4a files and all report the same as my calculated value (minus the fractional part).

For getting the bitrate then, this does indeed obviate the need to calculate it yourself.

But I do not think it is a substitute for "knowing" the actual audio data size (that the bitrate is calculated from, but without the rounding error), and the tag sizes.

While there's an argument that padding is rarely needed - and so it's single leading underscore makes sense - it is not able to be retrieved in a straightforward manner unless it's added to the classes.

jmeosbn commented 4 years ago

Just a further note for @MayuraVerma

bitrate seems to be nominal for some types or even 0. :(

I don't don't of this means the bitrate for MP3/4 will sometimes be nominal, but looking at the source the bitrate is sometimes calculated by the size/length (actual bitrate which is good) but length is often noted as an estimate. :/

I think it's reasonable to have access to the size of the data you're working with.

MayuraVerma commented 4 years ago

Just a further note for @MayuraVerma

bitrate seems to be nominal for some types or even 0. :(

I don't don't of this means the bitrate for MP3/4 will sometimes be nominal, but looking at the source the bitrate is sometimes calculated by the size/length (actual bitrate which is good) but length is often noted as an estimate. :/

I think it's reasonable to have access to the size of the data you're working with.

Thank you.

jmeosbn commented 4 years ago

@lazka I understand this is of a low priority, but if I figure out how to add these sizes (the internals are new to me) will you accept a PR?

MayuraVerma commented 4 years ago

@jmeosbn can you please share the details? @lazka Please advise

jmeosbn commented 4 years ago

@jmeosbn can you please share the details? @lazka Please advise

Hi there, if you're referring to padding - I did look into it but haven't got far yet.

Unfortunately for what is probably a straightforward addition for someone familiar with the codebase, it's a much more involved project to learn that codebase for me.

Add to that no certainty that it'd be accepted (else surely it'd just be added by now) or that it's perhaps a more involved addition than I'd expect. (Again, lack of guidance, along with consistency in api such as not providing bitrate, sample rate etc. for some formats has put me off somewhat)

Also, using the example padding hook you can pull the value from there (it just needs to be called at some point by the internal code).