Parsing big XML files with "lxml.objectify.fromstring" returns an error

antonhagg commented 8 years ago

This is mainly related to #78 where an xml- file can grow quite big (in my case its around 500 mb and contains 779917 files and 90361 folders). But I guess this could happen otherwise too.

Anyway, there is an option to use a custom parser with the option "huge_tree" (http://stackoverflow.com/questions/11850345/using-python-lxml-etree-for-huge-xml-files). Would this be an option or is there another way of parsing large xml-files, for example in chunks?

reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 27 Jan 2016 14:13:59 GMT
header: Accept-Ranges: bytes
header: path_list_total_files: 779917
header: path_list_total_folders: 90361
header: Content-Type: text/xml
header: Transfer-Encoding: chunked
header: Server: Jetty(8.1.4.v20120524)
DEBUG:requests.packages.urllib3.connectionpool:"GET /jfs/XX/Jotta/Sync/Backup2?mode=list HTTP/1.1" 200 None
Traceback (most recent call last):
  File "C:\Python27\Scripts\jotta-download-script.py", line 9, in <module>
    load_entry_point('jottalib==0.4.1.post1', 'console_scripts', 'jotta-download')()
  File "c:\python27\lib\site-packages\jottalib\cli.py", line 258, in download
    fileTree = remote_object.filedirlist().tree #Download the folder tree
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 304, in filedirlist
    return self.jfs.getObject(url)
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 851, in getObject
    o = self.get(url)
  File "c:\python27\lib\site-packages\jottalib\JFS.py", line 839, in get
    o = lxml.objectify.fromstring(self.raw(url))
  File "src/lxml/lxml.objectify.pyx", line 1801, in lxml.objectify.fromstring (src\lxml\lxml.objectify.c:26755)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:82934)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:124533)
  File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:123074)
  File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:117114)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:110510)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:112276)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:111367)
lxml.etree.XMLSyntaxError: None

antonhagg commented 8 years ago

So I have found a workaround for this by first writing the xml to a file and then reading it into memory. This means that it doesn't have to have them both in memory at the same time. Is this an acceptable solution?

def get(self, url):
        'Make a GET request for url and return the response content as a generic lxml object'    
        url = self.escapeUrl(url)
        if "?mode=list" in url: #Check if we are requested a full tree of the directory
            if os.path.exists('temp.xml'):
                os.remove('temp.xml') 
            with open("temp.xml", "w") as text_file:
                text_file.write(self.raw(url))
            o = lxml.objectify.parse("temp.xml")
            o = o.getroot()
            if os.path.exists('temp.xml'):
               os.remove('temp.xml') 
        else:
            o = lxml.objectify.fromstring(self.raw(url))
        if o.tag == 'error':
            JFSError.raiseError(o, url)
        return o

havardgulldahl commented 8 years ago

Hey @antonhagg, I think you are right, we need to do something to limit our resource requirements. I'll take a look at your code, thanks!

havardgulldahl commented 8 years ago

Maybe we could try to create a StringIO object and , if we see that the file is really big, we write it to disk.

Then we parse with objectify.parse(fileobject).

antonhagg commented 8 years ago

Sounds like a good idea, won't have time to do anything until August. So if anyone else is up for the job, feel free. =)

havardgulldahl commented 8 years ago

@antonhagg I had a go at it, will you please test to see if current code in master works for you now?

antonhagg commented 8 years ago

Since "folder download" is not in the 0.5.1 release, I will have to add that first. Tried a new innstallation of the 0.5.1, but ran into a lot of trouble... will have to sort that out first.

havardgulldahl commented 8 years ago

@antonhagg The code has not been released yet. Are you able to install from git head? that is, with git clone, and not with pip?

havardgulldahl / jottalib

Parsing big XML files with "lxml.objectify.fromstring" returns an error #87