veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
270 stars 48 forks source link

Couldn't parse stream #1386

Closed marcosdsdba closed 7 months ago

marcosdsdba commented 9 months ago

Please, When I try to validate the PDF if it is Compliant, coming from a variable with data from a PDF (%PDF-1.4...) I convert it to InputStream to assemble the parser/ loader:
(using java 1.8, maven and veraPDF validation-model (org.verapdf/validation-model) versions [1.24.1 and 1.14.105])

PDFAParser loader = Foundries.defaultInstance().createParser(inputStream, PDFAFlavour.PDFA_1_B);

Returns the error:

org.verapdf.core.ModelParsingException: Couldn't parse stream
         at org.verapdf.gf.model.GFModelParser.createModelWithFlavour(GFModelParser.java:128)
         at org.verapdf.gf.model.GFModelParser.createModelWithFlavour(GFModelParser.java:112)
         at org.verapdf.gf.model.GFModelParser.createModelWithFlavour(GFModelParser.java:107)
         at org.verapdf.gf.foundry.VeraFoundry.createParser(VeraFoundry.java:75)

How do I validate a PDF in variable? Does veraPDF only accept PDFs that are on file? Is it not possible to pass the PDF by variable with raw data from a PDF?

Context:

I receive the PDF in XML format encoded in Base64 from an API I convert base64 and ok if I use Java Fileoutput I can create a PDF and validate the PDF/A Compliant, so far ok. But I need to validate that it is a valid PDF/A and then create the file. With veraPDF is it possible to validate a PDF that is in a variable? be the variable that I receive from the API (I used veraPDF to validate, but with the generated file, but I would only like to create the file after validating the PDF/A)

Thanks!

reftel commented 9 months ago

It is possible to pass an InputStream that is not backed by a file. The problem here is something else. I would suggest inspecting what´s actually in the stream, to see if the contents look like the start of a PDF file, or if perhaps some part of the processing (e.g. base64 decoding) have been omitted.

bdoubrov commented 9 months ago

Indeed, veraPDF supports validation of PDF files directly from InputStream. As parsing PDF requires random file access, internally it reads data from InputStream and depending on its size either turns it into a memory-based byte array or a file-based InputStream.

The exception call stack essentially means that the data from the input stream doesn't look as a valid PDF and cannot be parsed.