leosongwei / mutagen

Automatically exported from code.google.com/p/mutagen
GNU General Public License v2.0
0 stars 0 forks source link

Parsing non-mp4 file as mp4 allocates way too many objects #109

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
In the problematic case I've found, the file was mostly comprised of '\x00', 
which triggered a problematic case with Atom parsing.

* The atom 'length' is parsed as zero and the atom name as '\x00\x00\x00\x00'
* A new Atom object is created for every 8 bytes of the source file until it 
hits a non-zero length, which with some luck might seek to EOF.

In the memory dump I have, it ended up creating ~130Mb worth of Atom objects in 
memory, for what seems to be a ~5Mb file with mostly '\x00' in it.

Original issue reported on code.google.com by sidnei.d...@gmail.com on 9 Apr 2012 at 2:06

GoogleCodeExporter commented 9 years ago
- Created file with:
  dd if=/dev/zero of=output.m4a  bs=1M  count=10

- Loaded Atoms with:
  from mutagen.mp4 import Atoms
  atoms = Atoms(open("output.m4a", "rb"))

- Dumped memory with meliae.scanner.dump_all_objects
Total 3941125 objects, 50 types, Total size = 523.1MiB (548541255 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0 1310720  33 450887680  82  82     344 Atom
     1 1314725  33  54024927   9  92    4871 str
     2 1310910  33  31461840   5  97      24 int
     3     134   0  11024656   2  9911007840 list
     4     262   0    391312   0  99   49432 dict

- Added '__slots__' to Atom:
Total 3941132 objects, 50 types, Total size = 193.1MiB (202511553 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0 1310720  33 104857600  51  51      80 Atom
     1 1314730  33  54025145  26  78    4871 str
     2 1310910  33  31461840  15  93      24 int
     3     134   0  11024656   5  9911007840 list
     4     262   0    391312   0  99   49432 dict

- Changed to ignore atoms with atom.length = 0 (not sure this is valid anyway?):
Total 9005 objects, 49 types, Total size = 1.4MiB (1450768 bytes)
 Index   Count   %      Size   % Cum     Max Kind
     0     262   2    391312  26  26   49432 dict

Original comment by sidnei.d...@gmail.com on 9 Apr 2012 at 9:49

GoogleCodeExporter commented 9 years ago
Two suggested patches.

Original comment by sidnei.d...@gmail.com on 9 Apr 2012 at 9:58

Attachments:

GoogleCodeExporter commented 9 years ago
Turns out a zero-length atom is supposed to mean 'I'm the last atom, and my 
length extends to EOF'. Here's an updated patch.

Original comment by sidnei.d...@gmail.com on 9 Apr 2012 at 11:00

Attachments:

GoogleCodeExporter commented 9 years ago
FYI, previous comment taken from:

http://wiki.multimedia.cx/index.php?title=QuickTime_container#Byte_Ordering

"""
An atom consists of a size, a type, and a data payload. An atom is laid out as 
follows:

bytes 0-3    atom size (including 8-byte size and type preamble)
bytes 4-7    atom type
bytes 8..n   data

The 4 bytes allotted for the atom size field limit the maximum size of an atom 
to 4 GB. Quicktime also has a provision to allow atoms with 64-bit atom size 
fields by setting the size field 1 and adding the 8-byte size field after the 
atom type:

bytes 0-3    always 0x00000001
bytes 4-7    atom type
bytes 8-15   atom size (including 16-byte size and type preamble)
bytes 16..n  data

This is a logical exception since an atom always needs to be at least 8 bytes 
in length to account for the preamble. Therefore, if the size field is 1, load 
the 64-bit atom size from just after the atom type field.

If, on the other hand, the size field is 0, then the atom extends to the end of 
the file.
"""

Original comment by sidnei.d...@gmail.com on 9 Apr 2012 at 11:13

GoogleCodeExporter commented 9 years ago
 - you forgot to seek back
 - zero size is only allowed for top level atoms, so you can raise an exception if that's not the case (unfortunately that info is not available, so maybe extend Atom to take a parent=None ?)

Original comment by reiter.christoph@gmail.com on 10 Apr 2012 at 7:20

GoogleCodeExporter commented 9 years ago
Uhm, why seek back? The last line in Atom.__init__ will seek to self.offset + 
self.length, which will put it back to EOF again. I guess in the case it's a 
top-level container atom?

I'm wary to add a parent attribute, since that might end up causing a reference 
cycle (not sure that's what you mean anyway). Maybe having level=0 instead of 
parent=None, and passing level=self.level+1 to children?

Original comment by sidnei.d...@gmail.com on 10 Apr 2012 at 11:09

GoogleCodeExporter commented 9 years ago
The atom name could still be in _CONTAINERS with a zero length.

Original comment by reiter.christoph@gmail.com on 10 Apr 2012 at 11:18

GoogleCodeExporter commented 9 years ago
Re-reading the information above, it also seems that if size is < 8 then it 
must be either 0 or 1, and anything else is invalid?

Original comment by sidnei.d...@gmail.com on 10 Apr 2012 at 11:30

GoogleCodeExporter commented 9 years ago
Right, that should be changed to raise an exception.

Original comment by reiter.christoph@gmail.com on 10 Apr 2012 at 11:49

GoogleCodeExporter commented 9 years ago
Here's an updated patch with the requested changes.

Original comment by sidnei.d...@gmail.com on 10 Apr 2012 at 12:01

Attachments:

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r115.

Original comment by reiter.christoph@gmail.com on 24 Jul 2012 at 11:53

GoogleCodeExporter commented 9 years ago
Changes: I removed the __slots__ again, since the creation of many objects 
shouldn't happen anymore and I don't want to risk breaking anyone’s code..

Original comment by reiter.christoph@gmail.com on 24 Jul 2012 at 12:01