wader / fq

jq for binary formats - tool, language and decoders for working with binary and text formats
Other
9.53k stars 218 forks source link

jpeg: Add parsing of DHT parameters #934

Closed matmat closed 2 months ago

matmat commented 2 months ago

This is my try at adding parsing of Huffman table parameters for the DHT segment in JPEG files. Feel free to clean it up as do not speak Go very well :)

wader commented 2 months ago

Hey, thanks! looks good i think. Do you know if the dht tables are large or usually small? looks small when i tried it on a few images.

Please run go test ./format ./pkg/interp -update (run without -update to just see diff) to write new expected test output, review the changes and add amend to the commit if it looks good.

If you want you could also add a new test file, the dht in 4x4.jpeg looks quite simple, maybe want something more realistic?

wader commented 2 months ago

Hi again, i fmt:ed the code and updated the tests

wader commented 2 months ago

@matmat Thanks!

matmat commented 2 months ago

Thank you for merging and cleaning it up! Sorry for not coming back sooner, unfortunately I did not have the time. As to your question about the length. According to[1] "The maximum number of DCT byte codes possible in the baseline JPEG format is 348", though they observed a maximum of 277 in the datasets they looked at.

  1. https://commons.erau.edu/jdfsl/vol13/iss2/7/

Would you accept a similar PR for missing parameters for other markers? (eg. "Ri" for DRI)

wader commented 2 months ago

Thank you for merging and cleaning it up! Sorry for not coming back sooner, unfortunately I did not have the time. As to your question about the length. According to[1] "The maximum number of DCT byte codes possible in the baseline JPEG format is 348", though they observed a maximum of 277 in the datasets they looked at.

Good 👍 mostly worried if something can decode into millions of fields then maybe decoding of that should be made optional using a format option.

  1. https://commons.erau.edu/jdfsl/vol13/iss2/7/

Would you accept a similar PR for missing parameters for other markers? (eg. "Ri" for DRI)

Sure! will accept anything that is either in standards or used in public. The whole point of fq is to decode as detailed as possible, except maybe decode to actual pixels (maybe that also in some cases) so i'm very happy if you want to help fill in missing things! 😄

wader commented 1 week ago

@matmat just noticed https://www.diva-portal.org/smash/get/diva2:1870437/FULLTEXT02.pdf congratulations! 🥳 have only briefly scrolled thru it yet but will surely have a deeper look! how was it to use fq? is there any more info how it was used?

matmat commented 1 week ago

@wader Thank you! :) We mainly used fq to extract the marker segments and their parameters as an intermediate step towards transforming the data to tabular form suitable for ML processing. This sure saved us a lot of time! fq already suporting extracting this information in a structured way was very very helpful. So many thanks for a useful tool!

I have now documented some details here (all very hacky): https://github.com/matmat/jpeg_encoder_ml_classification/

I guess maybe the first three steps are the most relevant from an fq perspective:

1) jpmarkers2.py is custom script that always removes image data from a jpeg (the ECS "segment"), along with the marker segments specified with -r. This is because we are not interested in the image data and to have smaller files to work with in the next steps.

for f in *.jpg; do
    jpmarkers2.py -r APP1,APP2,APP3,APP4,APP5,APP6,APP7,APP8,APP9,APP10, \
                     APP11,APP12,APP13,APP14,APP15,RST0,RST1,RST2,RST3,RST4, \
                     RST5,RST6,RST7 \
                  -i $f -o cleaned_$f
done

2) Extract features with fq and pipe through jq for pretty printing.

for f in cleaned_*.jpg; do
    fq -r '.|tojson' $f | jq . > $(basename -s .jpg $f).json;
done

3) Transform the json output from fq to tsv and also do some slight post-processing like concatinating qtables to hexstrings among other small things.

for f in *.json; do
    transform.py < $f > tsv/$(basename -s .jpg $f).tsv;
done
matmat commented 1 week ago

@wader Also, while working on this, I stumbled on the concept of Interval Parsing Grammars. This would be very interesting to explore further to make robust parsers for binary file formats. But maybe that is a bit out of scope for fq?

wader commented 1 week ago

Great to hear it was useful! this kind of usage is the reason why fq exists to begin with :) was first created to query media files in various exotic ways while developing and debugging codec and packaging software. ... but also just a way for me to learn more about media files :)

BTW instead of fq -r '.|tojson' $f | jq . you can probably do fq tovalue $f or same thing using -V fq -V . $f (tovalue convert the decode tree into a jq value and then it gets outputted as JSON)

Nope haven't heard of IPG before, looks very interesting, thanks for sharing! something like that is very much in scope for fq. I've been exploring various ways to do "runtime" formats for fq but nothing finished yet. There is a WIP prototype to add kaitai support, and i see it's mentioned in the paper, looks a bit similar. Usage would then be something like fq -d /path/to.ksy <query> file