translate / amagama

Web service for implementing a large-scale translation memory
http://amagama.translatehouse.org
GNU General Public License v3.0
90 stars 27 forks source link

Values larger than 1/3 of a buffer page cannot be indexed. #3184

Closed unho closed 5 years ago

unho commented 10 years ago

Got the following traceback when importing some translations:

Traceback (most recent call last):
  File "/home/leo/Escritorio/repos/amagama/bin/amagama-manage", line 41, in <module>
    manager.run()
  File "/home/leo/Escritorio/repos/envs/amagama/lib/python2.6/site-packages/flask_script/**init**.py", line 423, in run
    result = self.handle(sys.argv[0], sys.argv[1:])
  File "/home/leo/Escritorio/repos/envs/amagama/lib/python2.6/site-packages/flask_script/**init**.py", line 402, in handle
    return handle(app, _positional_args, *_kwargs)
  File "/home/leo/Escritorio/repos/envs/amagama/lib/python2.6/site-packages/flask_script/commands.py", line 145, in handle
    return self.run(_args, *_kwargs)
  File "/home/leo/Escritorio/repos/amagama/amagama/commands.py", line 126, in run
    self.real_run(slang, tlang, project_style, filename)
  File "/home/leo/Escritorio/repos/amagama/amagama/commands.py", line 140, in real_run
    self.handledir(filename)
  File "/home/leo/Escritorio/repos/amagama/amagama/commands.py", line 194, in handledir
    self.handlefiles(dirname, entries)
  File "/home/leo/Escritorio/repos/amagama/amagama/commands.py", line 187, in handlefiles
    self.handlefile(pathname)
  File "/home/leo/Escritorio/repos/amagama/amagama/commands.py", line 176, in handlefile
    project_style, commit=True)
  File "/home/leo/Escritorio/repos/amagama/amagama/tmdb.py", line 344, in add_store
    commit)
  File "/home/leo/Escritorio/repos/amagama/amagama/tmdb.py", line 361, in add_list
    self.get_all_sids(units, source_lang, project_style)
  File "/home/leo/Escritorio/repos/amagama/amagama/tmdb.py", line 311, in get_all_sids
    cursor.executemany(insert_query, params)
psycopg2.OperationalError: index row size 3440 exceeds maximum 2712 for index "sources_en_text_unique_idx"
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
vince62s commented 6 years ago

Is this going to be fixed one day or is the project left on the sidewalk? cheers.

friedelwolff commented 6 years ago

I've used this diff to tmdb.py locally as a workaround:

@@ -240,6 +284,9 @@ CREATE INDEX targets_%(slang)s_sid_lang_idx ON targets_%(slang)s (sid, lang);
             %%(sid)s, %%(target)s, %%(target_lang)s)""" % slang
             cursor.execute(query, unit)

+    def usable_units(self, units):
+        return filter(lambda u: max(len(u['source']), len(u['target'])) < 2712, units)
+
     def get_all_sids(self, units, source_lang, project_style):
         """Ensures that all source strings are in the database+cache."""
         all_sources = set(u['source'] for u in units)
@@ -348,6 +395,7 @@ CREATE INDEX targets_%(slang)s_sid_lang_idx ON targets_%(slang)s (sid, lang);
             # store them
             return 0

+        units = self.usable_units(units)
         self.get_all_sids(units, source_lang, project_style)

         try:

Postgres 10 fixed hash indexes, so maybe that is a better solution, if it provides all the functionality required for this index. The documentation mentions that only B-tree indexes can be used for unique indexes, so I might be wrong.

Amagama is still in production, so this might still get fixed.

friedelwolff commented 5 years ago

I've just looked into this some more. It seems to bee a bit more subtle. Postgres compresses text values, so it can handle longer values in the unique index, as long a it can compress down small enough (the index also contains other columns, so I don't know exactly how small it should be). So while my patch is correct and avoids the error, it will filter out some values that could otherwise maybe be handled successfully.

friedelwolff commented 5 years ago

I believe I fixed this reasonably well now. I tested the behaviour of the compression a bit, and I think the current code will attempt to import long strings up to the point that it should still work.

Note that I also updated the code to respect MAX_LENGTH during import. The old value of 1000 will often have a greater influence than the code fixing this problem. Anything beyond 1000 characters is really long and not quite in the domain of traditional translation memory. I increased the limit to 2000 in a follow-up commit anyway, just in case.