okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

XML parsing error while running Rosie #19

Closed jtemporal closed 7 years ago

jtemporal commented 7 years ago

@cuducos I was running Rosie and got this error below, could you review it please?!

(serenata_rosie) 19:03:52 at rosie (master)$ python rosie.py run
2017-01-11 19:04:23 Creating the CSV file
2017-01-11 19:04:24 Reading the XML file
2017-01-11 19:04:24 Writing record #346 to the CSV
2017-01-11 19:04:24 Done!
2017-01-11 19:04:24 Creating the CSV file
2017-01-11 19:04:24 Reading the XML file
2017-01-11 19:06:19 Writing record #337,740 to the CSV
2017-01-11 19:06:19 Done!
2017-01-11 19:06:19 Creating the CSV file
2017-01-11 19:06:19 Reading the XML file
Traceback (most recent call last): #114,024 to the CSV
  File "rosie.py", line 36, in <module>
    command()
  File "rosie.py", line 23, in run
    rosie.main(target_directory)
  File "/home/temporal/Documents/Serenata/rosie/rosie/__init__.py", line 64, in main
    dataset = Dataset(target_directory).get()
  File "/home/temporal/Documents/Serenata/rosie/rosie/dataset.py", line 16, in get
    self.update_datasets()
  File "/home/temporal/Documents/Serenata/rosie/rosie/dataset.py", line 28, in update_datasets
    ceap.convert_to_csv()
  File "/home/temporal/anaconda3/envs/serenata_rosie/lib/python3.5/site-packages/serenata_toolbox/ceap_dataset.py", line 36, in convert_to_csv
    convert_xml_to_csv(xml_path, csv_path)
  File "/home/temporal/anaconda3/envs/serenata_rosie/lib/python3.5/site-packages/serenata_toolbox/xml2csv.py", line 70, in convert_xml_to_csv
    for json_io in xml_parser(xml_file_path):
  File "/home/temporal/anaconda3/envs/serenata_rosie/lib/python3.5/site-packages/serenata_toolbox/xml2csv.py", line 23, in xml_parser
    for event, element in iterparse(xml_path, tag=tag):
  File "src/lxml/iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:148582)
  File "src/lxml/iterparse.pxi", line 193, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:148280)
  File "src/lxml/iterparse.pxi", line 224, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:148818)
  File "src/lxml/parser.pxi", line 1374, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:114116)
  File "src/lxml/parser.pxi", line 586, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:104990)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "/tmp/serenata-data/AnosAnteriores.xml", line 2
lxml.etree.XMLSyntaxError: Couldn't find end of Start Tag numEspecificacao, line 2, column 1
cuducos commented 7 years ago

Looks like an issue with the syntax of the XML from the Chamber of Deputies… Have you seen something similar, @irio?

cuducos commented 7 years ago

Probably a syntax issue with one of the XML files:

$ xmllint data/AnosAnteriores.xml
data/AnosAnteriores.xml:1: parser error : StartTag: invalid element name
</numParcela><txtPassageiro/><txtTrecho/><numLote>0</numLote><numRessarcimento>0
                                                                               ^
data/AnosAnteriores.xml:2: parser error : Premature end of data in tag numRessarcimento line 1

^
data/AnosAnteriores.xml:2: parser error : Premature end of data in tag DESPESA line 1

^
data/AnosAnteriores.xml:2: parser error : Premature end of data in tag DESPESAS line 1

^
data/AnosAnteriores.xml:2: parser error : Premature end of data in tag orgao line 1

^

The xmllint (available in macOS and Linux AFAIK) can fix it for us, maybe we will miss one or other record:

$ mv data/AnosAnteriores.xml data/AnosAnteriores.xml.bkp
$ xmllint data/AnosAnteriores.xml.bkp  --recover --output data/AnosAnteriores.xml

However I'm not sure if this is a proper solution, I mean, to calling xmllint from Python in case of error. Any ideas?

cuducos commented 7 years ago

More clues that the file is actually broken: the end of the file is in the middle of a tag… https://nbviewer.jupyter.org/gist/cuducos/50df395ac13fbf7159282cb6f6c109c3

jtemporal commented 7 years ago

I'm closing this since #22 fixed the parsing error =)