Open MayuraVerma opened 4 years ago
If this is true, shouldn't the mutagen open just the metadata and save the only delta, instead of write the entire file.
We do, if there is enough space in the file. The second change definitely does because we add extra space for future changes if we have to rewrite the file.
We do, if there is enough space in the file. The second change definitely does because we add extra space for future changes if we have to rewrite the file.
The music files already has metadata. I am not adding any new fields, but making changes in the value.
Should I clear the existing meta first, and then write the data?
Changes to the value can also lead to metadata needing more space (one more character is enough), and even if nothing changes we might need slightly more space because we might write things back slightly differently then other programs.
So only if the second update needs a rewrite there is a bug, everything else is expected.
Not sure how I can help here sadly. Ideas welcome.
Could we add three function
With these function, user can query all files in a loop and get the available space. Calculate the new metadata size. Compare two numbers, estimate how many files will be re-written. Drop fields that are not important, re-calculate the space required.
Have a look at https://mutagen.readthedocs.io/en/latest/user/padding.html
It doesn't provide everything you ask for though. For any new features we'd need to have a good use case first.
This is what I did.
def max_padding(info): return 1000000
I padded the m4a with 1000000 bytes. Saved the files : audio.save(padding=max_padding) Then, added couple of new fields audio["----:TXXX:TEST7"] = bytes("TEST7", 'UTF-8') audio["----:TXXX:TEST8"] = bytes("TEST8", 'UTF-8') audio.save() and saved the files with no padding option.
File size was reduced, because the function calculated the size of new fields and apply default padding. Files are rewritten to the drive with smaller size (perhaps applied default padding). It didn't just add the new fields and retained the remaining empty space. How to make use of the custom large padding? recalculate the size of the metadata and padding, apply the remaining padding as custom padding?
We need:
how to calculate the "the size of metadata that user is trying to add before save function.". I mean the size of new set data.
To know the absolute size of audio and metadata envelope Currently, we can assign delta padding to metadata.
Flag in the save function to retain the file size, not use default padding.
How to make use of the custom large padding?
def mypadding(info):
if info.padding < 0: # in case there is no padding left after write add lots
return 1000000
else: # if there is some left, just use it
return info.padding
I ended up doing something similar but it would be nice to have a setting such as "don't increase padding if no need" and "don't decrease ever".
This could be presented as a "finalise" option in the options dict for save() - to avoid accidental use - for when you don't intend to do anything more in future with the file (other than fixing typos etc.).
My custom padding starts off with:
default = info.get_default_padding()
# only accept increase in padding
calculated = max(info.padding, default)
# difference between desired and current
delta = calculated - info.padding
You can keep track of expected album art sizes and mutagen will give you some sizes of tags etc.
It is DIY atm though so there could be improvements here - I'd offer a PR but I can barely remember the details of it atm!!
Forgot to mention that this issue rang a bell for me and I remembered now, it was when scripting the adding of album art to large files and normally requiring re-write.
I added a delay as it was only a very short temp script.
I added a delay as it was only a very short temp script.
I am doing the same. But unfortunately, I have around 900 GB of data. SSD has only 30GB SLC, rest is MLC. It would take for ever to modify.
I might have to modify the data at least 3 times in next few months. It would be a very length process.
I'm removing artwork first and then writing my text data with that free space. No need to rewrite most of the time if this fits your situation.
def mypadding(info): if info.padding < 0: # in case there is no padding left after write add lots return 1000000 else: # if there is some left, just use it return info.padding
This would work. Is it possible to calculate size of the metadata? This way, I can reasonable size for padding. 1MB padding on 95000 songs, is 95GB overhead.
I can however remove the tag, save the file. And apply the tag with default padding, save the file and compare the difference. It would be nice to query the size of metadata in a program.
MP3 has tags.size() but you need to calculate for mp4, is that some parity that could be added @lazka ?
Atm, I'm getting the atoms and seeing what size they are... again, I'd love to be able to do this in an "approved" manner as I don't think atoms for an opened file are made accessed?
I've opened the file using Atoms() then looped through to add up tag atoms, then the one mdat
atom subtracted from the file size will include padding and stuff too.
Note: incomplete but just the basics:
atoms = Atoms(fileobj)
# size of first track data
atoms.path(b"mdat")[0].length
@lazka Quick question: one thing that puts me off doing the above, is that while I think it should just efficiently use the memory already read (for the still open File), when re-opening, I don't know that it does. Likewise for re-opening as raw - it does seem efficient but I have not tested or looked into it.
So what must one be aware of when having a second read only "view" of the atoms, or re-opening a just closed file* in order to read some tags raw.
*I'm assuming that closing one writable instance before opening another would be required?
I'm removing artwork first and then writing my text data with that free space. No need to rewrite most of the time if this fits your situation.
nope, actually I need the artwork. If I can calculate the space in a loop, I can prioritize some fields.
So we are all on the same page, this is what the default does: https://github.com/quodlibet/mutagen/blob/eee54961a15620a83ca28e9762d21ccbad781d47/mutagen/_tags.py#L44 It tries to add more padding if rewriting the file would take longer.
Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?
I don't know that it does. Likewise for re-opening as raw - it does seem efficient but I have not tested or looked into it
The operating system will keep most things from disk in RAM, so this should be fairly fast, but ymmv.
*I'm assuming that closing one writable instance before opening another would be required?
Yes
Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?
This would help to understand the size of the tags in the file, and free size. By calculating the size of new set of data, I continue to drop some fields until, tags are written without resizing the file.
We need a way to know the size of the tag, padding outside save function and calculate size of new set of tags.
Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?
This would be useful for calculating average bitrate of audio data vs. audio+tags; pre-guessing if padding is likely to be needed etc.
That's off the top of my head but It's useful enough that I made my own version. 😉
Another one I've hacked in for now and would be useful here is the _padding
property on on the MP4 file instance.
On the MP3 type (and others?) it's not passed through to the main File from _EasyID3__id3
as it is for mp4.
It'd be really handy to have this info available on all file types.
The operating system will keep most things from disk in RAM, so this should be fairly fast, but ymmv.
Ah, seems like caution is warranted then. The reason I ask is that I got some ways into a project using the Easy* tags and then finding if I want to do anything with covers and sizes etc. I need to open with "raw" tags (easy=false).
I've thought of rewriting and just using raw, but honestly, I've put some time in to extending the easy tag definitions (something I hope to put in a PR if you're interested?). Are there any plans to make either the artwork or the raw tags accessible from easy tags?
They already are passed via the Register... functions, but not to inspect via the instance interface. That could obviate the need to access raw, if the size and padding size etc. is accessible.
Exposing the metadata size would be possible I guess, but I'm wondering how you would use it?
This would help to understand the size of the tags in the file, and free size. By calculating the size of new set of data, I continue to drop some fields until, tags are written without resizing the file.
We need a way to know the size of the tag, padding outside save function and calculate size of new set of tags.
That seems like a very special use case and not something I'd like to support. But I still don't know why you need this, so not sure.
This would be useful for calculating average bitrate of audio data vs. audio+tags
You can already get the overhead by doing filesize - (bitrate * audio length)
Are there any plans to make either the artwork or the raw tags accessible from easy tags?
Not that I know of.
You can already get the overhead by doing
filesize - (bitrate * audio length)
Ah, that's good to know - when I first started using mutagen that number seemed to be calculated differently, now it's pretty close to what I calculate.
# info.bitrate
62794
# my calculation
62794.28581226417
While this is good for the bitrate, since it is rounded, it will be out by several kb on a medium sized file (~300MB).
Here are the sizes for the above:
218057642 # total
# audio data - inferred size from bitrate
215625061.56408164
# audio data - using size of 'mdat'
215626043
The padding is ~2048~ 59541 (likely due to previously removed artwork).
Subtracting the audio data size only gives the size for "everything else" - tags, artwork, padding plus any other tracks inc. qt chapters, subtitles, other audio/video data etc.
You can already get the overhead by doing
filesize - (bitrate * audio length)
How accurate is this?
Is it possible to please provide to two data. audio size? tag size? outside save function.
I think it used to be the "nominal" bitrate from the header - but I've tried a few mp3 and m4a files and all report the same as my calculated value (minus the fractional part).
For getting the bitrate then, this does indeed obviate the need to calculate it yourself.
But I do not think it is a substitute for "knowing" the actual audio data size (that the bitrate is calculated from, but without the rounding error), and the tag sizes.
While there's an argument that padding is rarely needed - and so it's single leading underscore makes sense - it is not able to be retrieved in a straightforward manner unless it's added to the classes.
Just a further note for @MayuraVerma
bitrate seems to be nominal for some types or even 0. :(
I don't don't of this means the bitrate for MP3/4 will sometimes be nominal, but looking at the source the bitrate is sometimes calculated by the size/length (actual bitrate which is good) but length is often noted as an estimate. :/
I think it's reasonable to have access to the size of the data you're working with.
Just a further note for @MayuraVerma
bitrate seems to be nominal for some types or even 0. :(
I don't don't of this means the bitrate for MP3/4 will sometimes be nominal, but looking at the source the bitrate is sometimes calculated by the size/length (actual bitrate which is good) but length is often noted as an estimate. :/
I think it's reasonable to have access to the size of the data you're working with.
Thank you.
@lazka I understand this is of a low priority, but if I figure out how to add these sizes (the internals are new to me) will you accept a PR?
@jmeosbn can you please share the details? @lazka Please advise
@jmeosbn can you please share the details? @lazka Please advise
Hi there, if you're referring to padding - I did look into it but haven't got far yet.
Unfortunately for what is probably a straightforward addition for someone familiar with the codebase, it's a much more involved project to learn that codebase for me.
Add to that no certainty that it'd be accepted (else surely it'd just be added by now) or that it's perhaps a more involved addition than I'd expect. (Again, lack of guidance, along with consistency in api such as not providing bitrate, sample rate etc. for some formats has put me off somewhat)
Also, using the example padding hook you can pull the value from there (it just needs to be called at some point by the internal code).
I am batch processing 1.25 TB, 92500 files, each file around 10-12MB, in python on mac. All the songs are stored in Samsung EVO SSD It loops through the audio songs and apply tags. Macbook Pro restarts with white screen after looping through few hundred files.
I am suspecting the mutagen is writing the "entire file to drive" and drive running out of cache and crashing the system.
If this is true, shouldn't the mutagen open just the metadata and save the only delta, instead of write the entire file.