lamuguo / re2

Automatically exported from code.google.com/p/re2
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Is the performance of re2.set is worse than PartialMatch in many regex? #126

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.use re2.set like this, and take about 600+us.
    RE2::Set s(RE2::DefaultOptions, RE2::UNANCHORED);
    s.Add("b.*abc", NULL);
    s.Add("b[^a]*abc", NULL);
    s.Add("(union.*information_schema)?dbaccz", NULL);
    s.Add("(union.*information_schema|use)*dbacc", NULL);
    s.Add(".*b*c*3*s*(where|insert|update|delete)yz", NULL);
    s.Add("(union.*information_schema|DB)?dbacc11", NULL);
    s.Add("(union.*information_schema|MYSQL)*dbacc2", NULL);
    s.Compile();
    s.Match("SELECT `objid`, `objtype`, `attr`, `name`, `timestamp` FROM reg_test_55 WHERE `srcid`=894834478 AND `srctype`=1 AND `objtype`=1 AND `obj
id` IN(651905801,155982993,750366247,5610224,513743778) union SELECT lemma_id 
as lemmaId, latest_version_id, is_default, rank as rank, update_time, FROM 
tblLemma WHERE lemma_title = 
unhex('6c617374206d6f6e74682077652061736b6564206f75722073797564656e7474686569722
052061637469766974696573572652061626f75742066') AND latest_version_idd > 0 AND 
typee = unhex('6c6173742056e')", &v);
2. use PartialMatch like this,and it take about 300+us
    vector<RE2*> re_test;
    RE2 re0("b.*abc");
    RE2 re1("b[^a]*abc");
    RE2 re2("(union.*information_schema)?dbaccz");
    RE2 re3("(union.*information_schema|use)*dbacc");
    RE2 re4(".*b*c*3*s*(where|insert|update|delete)yz");
    RE2 re5("(union.*information_schema|DB)?dbacc11");
    RE2 re6("(union.*information_schema|MYSQL)*dbacc2");

    re_test.push_back(&re0);
    re_test.push_back(&re1);
    re_test.push_back(&re2);
    re_test.push_back(&re3);
    re_test.push_back(&re4);
    re_test.push_back(&re5);
    re_test.push_back(&re6);

    for(j = 0; j < re_test.size(); j++) {
       re_temp = re_test[j];
       ret = RE2::PartialMatch("SELECT `objid`, `objtype`, `attr`, `name`, `timestamp` FROM reg_test_55 WHERE `srcid`=894834478 AND `srctype`
=1 AND `objtype`=1 AND `objid` 
IN(651905801,155982993,750366247,5610224,513743778) union SELECT lemma_id as 
lemmaId, latest_version_id, is_default, rank as rank, update_time, FROM 
tblLemma WHERE lemma_title = 
unhex('6c617374206d6f6e74682077652061736b6564206f75722073797564656e7474686569722
052061637469766974696573572652061626f75742066') AND latest_version_idd > 0 AND 
typee = unhex('6c6173742056e')", *re_temp);
}

What is the expected output? What do you see instead?
So i see the re2.set is special for many regex,but why the performance of 
re2.set is worse than PartialMatch in many regex? I read the code and find that 
the re2.set call the RunStateOnByteUnlocked many times and it take a long time.
But i can not find the way to make it.thanks.

What version of the product are you using? On what operating system?
re2-20140304.tgz on the Linux 2.6.32_1-12-0-0 #1 SMP Mon Aug 12 17:59:52 CST 
2013 x86_64 GNU/Linux

Please provide any additional information below.
NOTE: If you have a suggested patch, please see
http://code.google.com/p/re2/wiki/Contribute
for information about sending it in for review.  Thanks.

Original issue reported on code.google.com by wangt...@gmail.com on 9 Dec 2014 at 12:48

GoogleCodeExporter commented 9 years ago
RE2 has moved to GitHub. I have not moved the issues over. If this issue is 
still important to you, please file a new one at 
https://github.com/google/re2/issues. Thank you.

Original comment by rsc@golang.org on 11 Dec 2014 at 4:45