stchris / untangle

Converts XML to Python objects
MIT License
612 stars 83 forks source link

ignoring "Junk after document element" error #51

Closed rmsandu closed 6 years ago

rmsandu commented 7 years ago

The error appears when I have some extra text (aka junk) written outside the root element node, ie <INFO> Something like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<INFO>
  <STUDY>11111
    <INSTITUTION>some name</INSTITUTION>
    <STUDY_DATE>189888</STUDY_DATE>
    <COMMENT>some comment</COMMENT>
  </STUDY>
</INFO>
ENED_IMAGE>
 <SCREENSHOTS/>

I was wondering if there is any way to ignore the "junk info" which is the text after the </INFO> and still be able to access the elements of the root such as let's say <INSTITUTION>. I have hundreds of XML files and the junk text differs every time.

stchris commented 6 years ago

I'm afraid there's nothing that immediately comes to mind. I can recommend BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/) for parsing XML that's not well formed. And sorry for taking a long time to respond.