grosjo / fts-xapian

Dovecot FTS plugin based on Xapian
GNU Lesser General Public License v2.1
97 stars 21 forks source link

commit when last commit was too long ago #44

Closed symphorien closed 4 years ago

symphorien commented 4 years ago

When indexing mail with unusually long headers, indexing becomes slow and memory hungry. Indexing such emails by batch of 1000 ends up eating all the memory, and the index process is killed. The next time indexing is triggered, all the work is lost so no progress can be done.

This patch enforces a commit every 5 minutes. This ensures that even if only 36 emails (real number) are indexed in the meantime, and the process is killed afterwards because memory usage is high, progress can be done.

About the choice of 5 minutes: commit time will be negligible when this triggers. In my workload, I saw both cases where the number of update and the time were the limiting factor triggering the commit.

grosjo commented 4 years ago

Actually, that is the purpose of

define XAPIAN_COMMIT_LIMIT 1000

(commit in all cases after 1000 header lines)

Can you put a trace in your setup to figure out the best option (timeout or number of header lines (which can be much higher than the # of emails, as you wrote above in case of long headers))

symphorien commented 4 years ago

I have had commits after 5 minutes which were after as low as 2 updates:

8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 531456 ms and 36 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 543433 ms and 30 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 321247 ms and 58 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 38535 ms and 1001 updates... 
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 20797 ms and 1001 updates... 
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 37363 ms and 1001 updates... 
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 300563 ms and 847 updates... 
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 301410 ms and 2 updates...   
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 301237 ms and 24 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 538577 ms and 17 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 582925 ms and 33 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 578313 ms and 38 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 300092 ms and 35 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 305128 ms and 30 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 300229 ms and 36 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 302074 ms and 28 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 311570 ms and 51 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 310660 ms and 72 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 573554 ms and 36 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 555565 ms and 41 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 300203 ms and 58 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 544652 ms and 71 updates...  
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 76778 ms and 1001 updates... 
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 24646 ms and 1001 updates... 
8652><MJb8HuCBqF7MIQAA0eh/aQ>: FTS Xapian: Refreshing after 276406 ms and 1001 updates...

So I don't think that a number-of-update limit is good enough (or we would have to set the limit to 2, which is terrible in terms of performance).

The number limit is necessary to keep the memory used by easy emails in check. The time limit is necessary to overcome larger ones.

grosjo commented 4 years ago

ok To understand a bit better, can you paste the content (hiding personal data) of an example email that indexing took so long (I see 300 sec for 2 updates for instance. that is terrible)

symphorien commented 4 years ago

I only investigated on the first blocking email. The indexation process slowed down on the CC: header of the email, which contained more than 500 email addresses. My understanding by looking superficially at the code is that a long header makes the number of ngrams explode by simple combinatorics.

Unfortunately, I can't anonymize a CC header, as by definition, it only contains identifying information.


Note that although that would be nice to solve the slowness of indexation in corner cases like this, this is not the point of this PR. The goal of this PR is to ensure that indexation does not get stuck on such corner cases, which we never will be able to eradicate completely.

grosjo commented 4 years ago

Yes. Still need to address the slowlyness on huge headers

grosjo commented 4 years ago

Can you run the latest git and confirm the proper execution ?

symphorien commented 4 years ago

I reindexed everything and it works!