Open exikyut opened 3 years ago
Hey thank you for taking the times to dig into to this. Following discussion on HN and digging into mutool here is a command that does extractly what you did in a split second:
mutool show -o meshes.u3d -b 107 ./my-pdf-file.pdf
I also tested this with a dozens pdf files from different models and it seems all of them use 107 as the object id so it would be safe to just use that for a while until the software updated.
Ah, perfect, that's even less fiddling!
Testing with tons of different files seems to be one of the keys with PDFs, it seems. I couldn't possibly have had any confidence in the thought that the same object IDs would actually be used throughout different files, but there you go.
It was a fun project poking around; anytime.
Following [the HN discussion](), I was most curious what the 3D PDFs had in them, as I recently did some poking around in a large number of form-generated PDFs and the exact specifics of my use case turned out to be sufficiently easy that my solution ended up reading the raw PDFs without using any frameworks. (Which let me process 2.5k PDF/s sec :D... but I digress).
So, I first fired up PDF Vole to have a bit of a look around. The source for this is at https://github.com/Rossi1337/pdf_vole, but an old ZIP with a compiled JAR can be found at http://www.softsea.com/download/PDF-Vole.html (this is the actual "download will begin shortly" page).
The PDF format is not particularly hierarchical, and and trying to reformat it into a tree structure does not produce particularly intuitive results. There wasn't much else to do except randomly expand and collapse random nodes while trying not to get too bored/disillusioned.
But then I suddenly stumbled on this bit of interestingness:
Which kind of looks 𝓮𝒙𝓽𝒓𝓮𝒎𝓮𝒍𝔂 𝒑𝓻𝒐𝓶𝒊𝓼𝒊𝓷𝒈...
Wow, okay. But how on earth to dig it out?
PDFs aren't HTML. There's no declarative/predictable structuring system. The sample file seems to reference "meshes.u3d" in a complex hierarchy involving Objects 7, 115, 113, and 108, with the file data over in Object 107, but that's highly unlikely to stay stable per file.
Well, not much else to do except open the file in
less
and see how bad it is.Let's search for the file length. Okay, here's the start of the raw stream, and the reference to Object 107:
And here's the filename:
Wait. The file reference is object 108, with the file itself as object 107. That's... probably not coincidental, I'd put a bit of money on the PDF generator library handling file embeds by inserting the stream object and following it with a FileSpec object immediately after.
While perhaps controversial, unintuitive, and having overtones of "this is absolutely the wrong approach and will blow up horribly in the future", the approach I ended up taking with the previous PDF extraction project I did was to take a lot of liberties with the exact way information was laid out in the output file, on the assumption that a) the low-level structure would either never change (down to newlines being put in the same places) or change so much likely all my code would need to be thrown out the window due to assumptions I hadn't taken into account, and b) that the amount of work required to figure out a high-level library would be greater than the amount of work required to "hack it", let all the abstractions leak, and take advantage of as many implementation details as happened to work in my favor.
Using the same approach here, considering this scenario, and deliberately encoding a bunch of specific assumptions...
stream
on a new line after the object reference...it's possible to extract the data straight out of the PDF by simply skittering through most of the file line-at-a-time, reading in the full contents of any streams found, keeping track of the contents of the last seen stream, then waiting for a line denoting "meshes.u3d".
Or, in PHP, which I tend to turn to when others use Python because it's faster (okay, okay, because I haven't learned Python yet):
Happily, the above doesn't use any particularly exotic PHP constructs, and its length readily lends itself to straightforward reimplementation in another language. I'm not sure about JavaScript - I may or may not have given up on my attempt the moment I remembered that Node doesn't have a straightforward batteries-included
fgets
- but I wouldn't be surprised if this can be done in OCaml in potentially even roughly a similar number of lines. (And it's probably even amenable to being structured in a functional way........ errr, until the export format unexpectedly changes after a software upgrade and everything has to get thrown out the window. Heh.)One extremely interesting and surprising sidenote that definitely resulted in a bit of an eyebrows-raised pause and a few blinks was this:
Seems like the software came from a competent environment that knows how to spend money on good engineers :)