sisimai / p5-sisimai

Mail Analyzing Interface for email bounce: A Perl module to parse RFC5322 bounce mails and generating structured data as JSON from parsed results. Formerly known as bounceHammer 4: an error mail analyzer.
https://libsisimai.org
BSD 2-Clause "Simplified" License
76 stars 26 forks source link

Perl Sisimail very slow on large Emails (regexp in Sisimai/Message.pm takes 6 minutes to finish) #492

Closed gody01 closed 1 year ago

gody01 commented 1 year ago

Hello,

I came accross a problem with Sisimai on (znuny/otrs) when customers sends large Emails with mimetype multipart/mixed and photos attachments.

Problems started with upgrade, when Sisimai was updated to version v4.25.11 (later we tried latest Sisimai release with same results).

It is especcialy severe one CentOS 7, where parsing email takes more than 6 minutes. On CentOS 8 and OpenSuSE it is a litle better, but still order of magnitude slower (16+ s) than with previously used version (v4.24.1), when it was sub second.

After some debuging and analyzing I found out, that regular expression in file Sisimai/Message.pm function makemap is responsibler for practicaly all execution the time:

# Select and convert all the headers in $argv0. The following regular expression
 # is based on https://gist.github.com/xtetsuji/b080e1f5551d17242f6415aba8a00239
 my $firstpairs = { $$argv0 =~ /^([\w-]+):[ ]*(.*?)\n(?![\s\t])/gms };

Real problem seems to be, that also base64 encoded attached images are parsed as body with this regexp.

If I truncate $$argv0 to max 32766 execution time is again subsecond:

    my $maxlen = 32766 ;
    $$argv0 =~ s/^(.{$maxlen}).*$/$1/gms;
    $$argv0 =~ s/^[>]+[ ]//mg;  # Remove '>' indent symbol of forwarded message

Example of problematic Email (15M large) is available her for some time:

https://nc.gody.si/index.php/s/jEwXr6zEfWpeMgs

azumakuniyuki commented 1 year ago

@gody01 Thanks for the report. I've parsed qq.eml at Macbook Air 2019 mid with Perl 5.30.0, then I have confirmed that the code you pointed is terribly slow.

% perl -v | head -2

This is perl 5, version 30, subversion 0 (v5.30.0) built for darwin-2level

% perl -I./lib -MSisimai -lE 'print Sisimai->version'
4.25.15

%  % time perl -I./lib -MSisimai -lE 'print Sisimai->dump(shift)' ./issue-492-large-email-15m-qq.eml
[{"addresser":"vzdrzevanje@merkurnepremicnine.si","rhost":"","deliverystatus":"","timezoneoffset":"+0000","recipient":"Marko.Rzek@merkur.si","token":"49503b13e7538a1c915185506624107ee4b6249f","lhost":"","smtpcommand":"","diagnostictype":"","origin":"./issue-492-large-email-15m-qq.eml","timestamp":1681838284,"reason":"vacation","softbounce":-1,"subject":"kanalete","action":"","catch":null,"destination":"merkur.si","alias":"","listid":"","diagnosticcode":"POZOR: Sporoèilo je bilo poslano od zunanjega ponudnika. Ne odpirajte priponk in povezav, èe ne prepoznate po¹iljatelja oziroma niste preprièani, da je poslana vsebina varna. Za silo je zadeva sanirana Prilagam sliko.","replycode":"","senderdomain":"merkurnepremicnine.si","messageid":"","feedbacktype":"","smtpagent":"RFC3834"}]
perl -I./lib -MSisimai -lE 'print Sisimai->dump(shift)'   35.66s user 0.53s system 94% cpu 38.363 total
Screenshot 2023-04-20 at 12 17 38

I will try to fix it within approximately two weeks. Thank you again :-)

gody01 commented 1 year ago

Thank You for Your work and effort.

My measurements on different systems

OpenSuse LEap 15.4:

perl -v | head -2 This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-thread-multi

perl -I. -MSisimai -lE 'print Sisimai->version' 4.25.11

time perl -I. -MSisimai -lE 'print Sisimai->dump(shift)' qq.eml real 0m47,757s user 0m47,582s sys 0m0,164s

CentOS 7.x:

perl -v | head -2 This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

perl -I. -MSisimai -lE 'print Sisimai->version' 4.25.11

time perl -I. -MSisimai -lE 'print Sisimai->dump(shift)' qq.eml real 4m39.688s user 4m38.490s sys 0m0.132s

azumakuniyuki commented 1 year ago

@gody01 I've parsed the same email file by fixed code. The performance seems to be improved.

Screenshot 2023-05-10 at 7 57 18
azumakuniyuki commented 1 year ago

Fixed code with v4.25.15p3 tag has been released.

gody01 commented 1 year ago

@gody01 I've parsed the same email file by fixed code. The performance seems to be improved.

Screenshot 2023-05-10 at 7 57 18

Excelent, thank You.

Will this patch level be published on CPAN ?

azumakuniyuki commented 1 year ago

@gody01 The next release of Sisimai including the fixed code will be released to CPAN as v4.25.16 this month.

azumakuniyuki commented 1 year ago

@gody01 v4.25.16 is available at CPAN (released today) https://metacpan.org/dist/Sisimaihttps://metacpan.org/dist/Sisimai

gody01 commented 1 year ago

Thank You. I already detected, downloaded and tested. Also reported to Znuny project, on which I detected the anomaly.