netvl / immeta

Image metadata inspection library in Rust
MIT License
24 stars 18 forks source link

Implementing Exif Parsing #1

Open niclashoyer opened 9 years ago

niclashoyer commented 9 years ago

Hi,

I started to write some basic Exif parsing for JPEGs based on this description.

Basically Exif is encoded TIFF inside the APP1 payload. Unfortunately the format is quite messy. One problem that bothers me is, that for each Exif tag the value is stored at an offset if it is larger than 4 bytes. This offset is relative to the start of the TIFF header (APP1 starting point + fixed offset).

I see two strategies here:

  1. store all references to data and proceed until all tags are read. Then read all references (as they are usually stored directly after the IFD). This would involve sorting the references, because they can be in any order and we need to read them in order.
  2. read all references directly after reading the reference, this involves seeking, thus the LoadableMetadata::load method needs Seek. But that gives an error for load_from_buf as Seek is not implemented for &[u8].

I prefer the second method, because it would make parsing a lot easier and doesn't need any sorting on references, but I don't know if it is possible to fix load_from_buf to be compatible.

Any ideas?

fisch42 commented 9 years ago

Maybe you could buffer the hole Exif payload. Is that feasible? Could also waste some memory.

netvl commented 9 years ago

I don't think that adding Seek to LoadableMetadata::load() is feasible because I intended for this library to be used in arbitrary contexts, for example, to read image metadata in a streaming mode from a socket. Seek requirement is too strong. That said, to overcome the error in load_from_buf() you can use std::io::Cursor which implements Seek.

So, I think that the first approach is better. Alternatively, as @fisch42 suggested, you can load the whole EXIF payload into memory. I think that his kind of buffering is okay, it is unlikely that EXIF metadata would take lots of memory. Then you can use direct offsets in a byte slice.

tdryer commented 8 years ago

I've implemented basic Exif support in my fork. The code probably isn't great (first thing I've written in Rust), but it might be a useful starting point.

My first approach for reading the tags was storing references, sorting them, and reading the data in order. I ended up dropping that and buffering the Exif data since using Seek simplified the code, and I was going to have to track how much data had been read in order to ensure it ends up at the end of the segment when finished parsing.

Another annoying thing about the Exif format is that it can be either big or little endian. To deal with that, I had to add my own read_u16 and read_u32 methods that take the byte order as an argument.

niclashoyer commented 8 years ago

The implementation looks great so far! One thing I noticed while trying to implement it using buffering is, that Exif data really is just TIFF, so conceptually it would be the best if we had a TIFF metadata parser and use that to parse Exif. The biggest problem with TIFF is, that image data can be anywhere in the file (just like Exif values), so unlike Exif the TIFF data can get really large, and buffering is no efficient option there.

netvl commented 8 years ago

Looks great, thanks! If you want, you can submit a pull request.

However, I agree with @niclashoyer in that I want to do things in general way if possible; this means that implementing a TIFF parser is the best option.

Well, it seems I should think of how to integrate the ability to seek inside the image data, while not requiring Seek implementation for those image formats which don't need it...

niclashoyer commented 8 years ago

@netvl just a thought: if you design optional seeking keep in mind that it may be worth to implement both types (seeking / non-seeking) and let the user of the library decide, e.g. if one wants to use immeta to parse TIFF files sent via network buffering the whole file is really bad and a little more complex implementation is "ok". But if one wants to use immeta to parse TIFF files from harddisk a seeking implementation is the best option, as it is still very fast.

netvl commented 8 years ago

Yes, that's something I was thinking of when I was writing that sentence, thanks!

netvl commented 8 years ago

I've added a read_from_seek<R: BufRead + Seek>() to LoadableMetadata and changed read() to require BufRead instead of just Read. This should make EXIF parsing implementation easier.