radkovo / Pdf2Dom

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
http://cssbox.sourceforge.net/pdf2dom/
GNU Lesser General Public License v3.0
179 stars 71 forks source link

xpath of dom element #47

Closed ashishsharma0 closed 3 years ago

ashishsharma0 commented 3 years ago

Hi , Could you please let me know how we can find an element in DOM by using xpath . please share some example/ code snippet to get the DOM tree and traverse it as needed I want to verify the colour of element is as expected. pdf = PDDocument.load(pdfFile); PDFDomTree parser = new PDFDomTree(); // parse the file and get the DOM Document Document dom = parser.createDOM(pdf); System.out.println("dom.getTextContent()"+dom.getTextContent()); System.out.println("dom.getDocumentElement()"+dom.getDocumentElement()); System.out.println(dom.getElementById("A300-327-GE"));

radkovo commented 3 years ago

Pdf2DOM produces a standard DOM, i.e. using the standard Java API should be possible. For example something like this:

pdf = PDDocument.load(pdfFile);
PDFDomTree parser = new PDFDomTree();
Document dom = parser.createDOM(pdf);
// XPath-related part starts here
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "/html//*[@id='A300-327-GE']"; // use your XPath expression here
nodeList = (NodeList) xPath.compile(expression).evaluate(dom, XPathConstants.NODESET);

See for example this tutorial or this one for more info about XPath in Java.

ashishsharma0 commented 3 years ago

Please suggest further got below output

expression//*[text()='A300-327-GE'] nodeListnull nodeList[div: null]

tried to use below

String expression = "//*[text()='A300-327-GE']"; // use your XPath expression here System.out.println("expression"+expression); NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(dom, XPathConstants.NODESET); System.out.println("nodeList"+nodeList.item(1)); System.out.println("nodeList"+nodeList.item(0));

On Mon, Mar 8, 2021 at 6:17 PM Radek Burget notifications@github.com wrote:

Pdf2DOM produces a standard DOM, i.e. using the standard Java API should be possible. For example something like this:

pdf = PDDocument.load(pdfFile);PDFDomTree parser = new PDFDomTree();Document dom = parser.createDOM(pdf);// XPath-related part starts hereXPath xPath = XPathFactory.newInstance().newXPath();String expression = "/html//*[@id='A300-327-GE']"; // use your XPath expression here nodeList = (NodeList) xPath.compile(expression).evaluate(dom, XPathConstants.NODESET);

See for example this tutorial https://www.baeldung.com/java-xpath or this one https://www.journaldev.com/1194/java-xpath-example-tutorial for more info about XPath in Java.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/radkovo/Pdf2Dom/issues/47#issuecomment-792966324, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATEPWBAFOCXRALRSN7NPX6DTCUIDTANCNFSM4YZ3FEGQ .

radkovo commented 3 years ago

It seems that your nodeList.item(0) returns some [div] so that it basically works and some elements are returned. If you expected a different result, you should probably debug your xpath and/or java code. Anyway, this issue doesn't seem to be related to Pdf2DOM directly so that it should be better discussed somewhere else.