sylikc / pyexiftool

PyExifTool (active PyPI project) - A Python library to communicate with an instance of Phil Harvey's ExifTool command-line application. Runs one process with special -stay_open flag, and pipes data to/from. Much more efficient than running a subprocess for each command!
Other
161 stars 21 forks source link

Reading a binary tag #47

Closed kaufManu closed 2 years ago

kaufManu commented 2 years ago

I need to read a tag that stores 100 bytes of binary data with a custom format that I have to parse myself. Without the -b option I'm getting the string (Binary data 100 bytes, use -b option to extract). I then use

tag = et.get_tags(["image.dng"], [TAG_NAME], ['-b'])[0]

This returns a dictionary with the name of the tag and the value of that tag as a string. However, I'd like to get the raw binary data so that I can parse it according to some external specifications. Is that possible? I think that on this line the binary data would be available, but it is automatically decoded to a string again. But additionally this line does not "just" return the value of the tag I'm interested in, but more information like SourceFile or the name of the Tag again.

Long story short: How can I get the raw binary data stored in a tag?

Edit I forgot to mention that if I do this on the command line

exiftool.exe -TAG_NAME -b image.dng > data.dat

The file data.dat contains the binary data that I would expect.

sylikc commented 2 years ago

that's an interesting problem... I guess I never thought about that use case when I changed the ExifTool base class with the encoding parameter. https://github.com/sylikc/pyexiftool/blob/master/exiftool/exiftool.py#L739

One of the biggest problems up till that point was having mismatched tag encodings... but binary data. Off the top of my head, you could use the -w flag to write it to file. But if you need it piped, I'd have to look at the code to think about it. Would probably have to make a fundamental change to revert it back to working with bytes... or output bytes with a flag.

sylikc commented 2 years ago

I guess I could end up writing a separate method that works only in bytes, like the v0.4.x way... but with the synchronization flags.

I made that change to go from bytes to string to fix this internationalization issue https://github.com/sylikc/pyexiftool/issues/29 ... would -w work or did you want to pipe that to bytes?

kaufManu commented 2 years ago

Thanks for the quick reply. Dumping it to a file would be fine for me - if that also supports batched processing?

I tested it out, but I don't know exactly how to specify the command in the get_tags function. On the commandline, this works:

exiftool.exe -TAG_NAME -b -w cmd.dat image.dng

which produces a file imagecmd.dat with the expected content. I've tried:

et.get_tags([".\\image.dng"], [TAG_NAME], ['-b', '-w', 'py.dat'])

This raises an Exception

  File "site-packages\exiftool\helper.py", line 347, in get_tags
    ret = self.execute_json(*exec_params)
  File "site-packages\exiftool\exiftool.py", line 1030, in execute_json
    result = self.execute("-j", *params)  # stdout
  File "site-packages\exiftool\helper.py", line 119, in execute
    raise ExifToolExecuteError(self._last_status, self._last_stdout, self._last_stderr, params)
exiftool.exceptions.ExifToolExecuteError: execute returned a non-zero exit status: 1

But it does create the file imagepy.dat. But the content of that file is still not just the binary data, it's again the dictionary with the additional SourceFile etc tags. I've also tried to specify the parameters as ['-b -w py.dat'] but that does not create the file in the first place.

sylikc commented 2 years ago

Thanks for the quick reply. Dumping it to a file would be fine for me - if that also supports batched processing?

Yes, it has some special features for batch processing actually. Search PH's ExifTool Documentation for -textOut and read the documentation on how exiftool uses the -w flag.

I tested it out, but I don't know exactly how to specify the command in the get_tags function. On the commandline, this works:

exiftool.exe -TAG_NAME -b -w cmd.dat image.dng

which produces a file imagecmd.dat with the expected content. I've tried:

et.get_tags([".\\image.dng"], [TAG_NAME], ['-b', '-w', 'py.dat'])

Ok, so this was a robustness change in v0.5.x. It raises an error because ExifToolHelper.get_tags supposedly always returns JSON. So you can't really use get_tags with -w.

Although... I would have expected a different error thrown. See the specific "Note" box at ExifTool.execute_json -w behavior

As per that note, the proper way to use -w is using the execute() method (can be used in ExifToolHelper). It's a little more manual, but it would be run just like it is on the command line

exiftool.exe -TAG_NAME -b -w cmd.dat image.dng

becomes

et.execute(*[f"-{TAG_NAME}", "-b", "-w", "cmd.dat", "image.dng"])
kaufManu commented 2 years ago

Awesome, the execute method does exactly what I want. For reference and other readers: I'm processing multiple images simply via

et.execute(*[f"-{TAG_NAME}", "-b", "-w!", "cmd.dat", "image1.dng", "image2.dng"])

I've added the ! to -w to override existing output files.

Thank you for your quick help and your work for providing this package and keeping it so well maintained!

sylikc commented 2 years ago

You're welcomed!

I'll think about adding a method like execute_bytes to do the piped output... it's certainly an interesting problem that I didn't consider when doing the encoding string change to fix internationalization issues...

kaufManu commented 2 years ago

Yeah that would definitely be helpful to avoid having to go over the disk to get to the data.

Let me know if I should test anything in the future.

sylikc commented 2 years ago

I'll think about it a bit more. I will have a chance to think about the design later next week...

Probably will have something for you to test with if I end up implementing it (leaning towards it)

I really didn't consider that binary use case before. Binary maker notes data always has looked like junk to me lol

sylikc commented 2 years ago

So I was just doing some testing, and I find that I can in fact use get_tags to get some binary tag... I just get some string that says 'MakerNotes:PreviewImage': 'base64: ......'

with ExifToolHelper() as et:
    print(et.get_tags("image.jpg", "MakerNotes:PreviewImage", params="-b"))

Is that what you're getting? You'd then be able to decode that directly

kaufManu commented 2 years ago

Yes, this at least gives access directly to the value of that tag, but since it's a string I don't know how to interpret it. I think the problem is that the execute function automatically decodes the bytes to a string (let's say with utf-8) instead of just returning the raw bytes. Simply encoding that string again to bytes does not work because the original bytes from the tag were not meant to be interpreted as utf-8 in the first place.

sylikc commented 2 years ago

So, the string that gets returned is a base64 encoded string that comes from the JSON encoding spec.

I looked into the code... just adding a execute_bytes method isn't enough to fix this... the fundamental changes to the code that was made with commit 137c0e2b957dc499b3df41d7eee1dc5355957978 to move away from bytes implementation to a string implementation on all the calls... actually makes it difficult to revert or support both at once.

The Popen()'s encoding parameter along with the encoding/decoding inside execute() ... I'm not sure how I would support bytes and string in the same class.

sylikc commented 2 years ago

Ah, with investigation, I might be able to change this after all... the Popen encoding is actually not used for the I/O to the process...

The only communication with the process in text is the stdin write. The reads are raw, unbuffered reads. I might need to create a branch and test these changes before making them live.

Would you be able to share a file with me and possibly a code snippet so I can test this against some useful binary data?

kaufManu commented 2 years ago

I need to double check whether I can share an example image - I'll be back!

sylikc commented 2 years ago

I need to double check whether I can share an example image - I'll be back!

Ok, well it's not necessary anymore. I wrote a test using a custom tag, and it looks like I can read/write binary tag without an issue. https://github.com/sylikc/pyexiftool/pull/48/commits/60d793f78277816fd73051e388dc9f456ca5ad45

I will do a bit more testing before merging... Need to write a few more tests, but this should address your issue.

If you get a chance @kaufManu check out the PR and test to see if it works for your use case. I've been a bit busy recently, but I'll merge this in after more rigorous testing.

kaufManu commented 2 years ago

@sylikc I finally got around to test the PR - apologies for my late reply! I've tested it like this

data = et.execute("-b", f"-{TAG_NAME}", example_dng, raw_bytes=True)

and it seems to work like a charm - thank you for the change, makes my life quite a bit easier :) ! Can this execute function also handle batched input (i.e. obtaining the same tag for multiple DNGs at the same time? Not a big deal if not, I'm just wondering.

sylikc commented 2 years ago

Can this execute function also handle batched input (i.e. obtaining the same tag for multiple DNGs at the same time?

So, it appears you can do it on the command line... but it comes out a mess. (it's just concatenated together) I can try adding an ExifToolAlpha function which may do it... I would hit up the tag once to get the amount of bytes, then parse it afterwards. Let me think about it.

sylikc commented 2 years ago

@kaufManu I think based on the 4/11 comment, using something similar ...

import base64
from exiftool import ExifToolHelper

def base64_recurse(d):
    for k, v in d.items():
        if isinstance(v, dict):
            base64_recurse(v)
        elif isinstance(v, str) and v.startswith("base64:"):
            d[k] = base64.b64decode(v[7:])

with ExifToolHelper(common_args=['-n', '-g']) as et:
    t = et.get_tags("*.jpg", "ThumbnailImage", params="-b")
    for x in t:
        base64_recurse(x)
    print(t)

might be a better way. The concatenated mess really would be hard to parse, especially if you try to figure out what tag came from what file... or integrating tags... I tried running exiftool -config files\my_makernotes.config -j -MyMakerNotes -ImageSize -b *.jpg in some test case and it would just get really messy really fast... as it's not easy to tell which tag came from what file... and such.

note: I used the "-g" tag in common_args just to have nested list to show that the base64_recurse works across nested dicts and stuff. It's optional. But I've verified this works to get the binary, though it's slightly more inefficient than using the raw binary execute() because exiftool has to encode into base64 and pass more data through the pipe... but for your use case where the data is small, it might be worthwhile to do it this way.

kaufManu commented 2 years ago

I see - thanks for the additional information! The batched version is not that important, so the current solution works just fine for me! Feel free to close this issue whenever you like.

Thanks again for your work!

sylikc commented 2 years ago

Fixed with v0.5.4