Importing of large TMs (e.g., 10GB XLIFF files) fails

pmarcis commented 5 years ago

Hi!

I deployed amagama on an Ubuntu machine with 64GB RAM, 2TB SSD (almost empty) and tried importing a 10GB XLIFF file (28.5 million segments). It did not work. I got the following errors (can't tell much from them though):

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/xliff.py", line 880, in parsestring
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/base.py", line 781, in parsestring
MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/flask_script/__init__.py", line 417, in run
MemoryError

During handling of the above exception, another exception occurred:

MemoryError

Then, I tried splitting the large TM in smaller chunks of 300,000 segments. That (almost) worked for 95/96 parts. The one part I had to split even further (down to chunks of up to 25,000) segments. The following error kept appearing (different from the large TM file):

Importing /home/marcis/general.tm.51g.xlf
ERROR:root:Error while processing: /home/marcis/general.tm.51g.xlf
Traceback (most recent call last):
  File "/home/marcis/amagama/amagama/commands.py", line 161, in handlefile
    store = factory.getobject(filename)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/factory.py", line 209, in getobject
    store = storeclass.parsefile(storefile)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/base.py", line 900, in parsefile
    newstore = cls.parsestring(storestring)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/xliff.py", line 880, in parsestring
    xliff = super(xlifffile, cls).parsestring(storestring)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/base.py", line 781, in parsestring
    newstore.parse(storestring)
  File "/home/marcis/anaconda3/lib/python3.7/site-packages/translate/storage/lisa.py", line 335, in parse
    self.document = etree.fromstring(xml, parser).getroottree()
  File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 94704

Once I had split everything down to the small parts, the small XLIFF files (almost) imported without errors.

Why almost?

For every file, amagama printed in the console:

Succesfully imported [FILE NAME]

But ... when I looked in the PostgreSQL data base, there were exactly 0 entries.

Any suggestions on what might have failed?

pmarcis commented 5 years ago

An update ...

I found out that the import function does not support my XLIFF files. As a workaround, I found that this works:

First, convert the XLIFF files to PO files with the command: xliff2po in.xlf out.po
Then, strip the out.po file from lines containing #, fuzzy (xliff2po somehow assumed that all my segments are just suggestions and not actual translations) using: sed -i '/#, fuzzy/d' out.po.

Once having done this, all files could be imported successfully!

I also had to switch from python 3.7 to 2.7 as amagama did not work with python 3.7.

friedelwolff commented 5 years ago

I haven't worked on this in a while, but my (unconformed) suspicion is that the problem might be in the Translate Toolkit and not in Amagama. Can you check if pocount from the Translate Toolkit work on these files? There was a change some time ago in the lxml library regarding the handling of large XML files, and this might be what is happening here, but I'm really just guessing.

friedelwolff commented 5 years ago

I also think I know why nothing was imported when using XLIFF: there is probably a mismatch between your view of the state of the translations and amaGama (really the Translate Toolkit). That is why the conversion to PO marks them as fuzzy. If you can paste a snippet of the XLIFF file I should be able to confirm. I'm guessing you don't have approved="yes".

pmarcis commented 5 years ago

I ran the pocount tool on one of the XLIFF files (1.3GB). I got the following result:

pocount corpus.xlf
Processing file : corpus.xlf
Type               Strings      Words (source)    Words (translation)
Translated:       0 (  0%)          0 (  0%)               0
Fuzzy:        2956993 (100%)   59813203 (100%)             n/a
Untranslated:     0 (  0%)          0 (  0%)             n/a
Total:        2956993          59813203                      0

Needs-Work:   2956993 (100%)   59813203 (100%)               0

My XLIFF files were generated from parallel corpora. The example is as follows (none of the segments have the approved attribute):

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file original="abc.txt" source-language="lt" target-language="en" datatype="plaintext">
    <header/>
    <body>
      <trans-unit id="20275" xml:space="preserve">
        <source>- JAV doleris puslapis .</source>
        <target>- US Dollar Index .</target>
      </trans-unit>
    </body>
  </file>
</xliff>

However, now that I know that there must be the approved attribute, I can actually add it in my conversion tool that converts the parallel corpus into XLIFF.

friedelwolff commented 5 years ago

I can't think of a reason why pocount would process it successfully but not amaGama. You might want to go with the smaller files (or PO conversion) for now. One advantage of doing several files is that you can run parallel import commands. You have to invoke multiple import commands manually, but I tried to make it a safe way to speed up import.

If you alter your XLIFF files to have approved="True", pocount should also report the number of target words.

By the way, I'm probably starting to work on Python 3 support soon.

friedelwolff commented 5 years ago

Oh, I misread what you wrote: pocount works on the smaller file. Ok then things are consistent.

Although we have from the outset worked with and planned for gigabyte sized databases, I don't think I've tried importing such large files. I can't think of a reason it shouldn't simply work, but the Translate Toolkit holds a complete file in memory while processing it, which is probably part of the problem here.

The issue with lxml parsing big files started with lxml 2.7 as a security precaution. You can add the parameter huge_tree=True next to resolve_entities=False close to the bottom of translate/storage/lisa.py if you are interested in diving into the code. (untested) It might help, but maybe the smaller files works well enough for your case?

translate / amagama

Importing of large TMs (e.g., 10GB XLIFF files) fails #3217