Open jwaldmann opened 2 years ago
This is the top of the profile
Thu Aug 4 15:18 2022 Time and Allocation Profiling Report (Final)
ceta-postproc +RTS -P -H -RTS /dev/stdin data/bound/3633.xml
total time = 2474.80 secs (2474795 ticks @ 1000 us, 1 processor)
total alloc = 1,770,234,125,896 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc ticks bytes
integer_of_ci Ceta CeTA-2.42/generated/Haskell/Ceta.hs:691:1-36 14.8 0.0 366746 432
integer_of_char Ceta CeTA-2.42/generated/Haskell/Ceta.hs:694:1-48 12.2 0.0 302392 0
comparator_list Ceta CeTA-2.42/generated/Haskell/Ceta.hs:(4238,1)-(4246,34) 10.5 0.0 260644 48
balance Ceta CeTA-2.42/generated/Haskell/Ceta.hs:(2241,1)-(2319,26) 9.3 64.4 230796 1139164301088
comparator_of Ceta CeTA-2.42/generated/Haskell/Ceta.hs:964:1-75 7.4 0.0 182817 32
less_char Ceta CeTA-2.42/generated/Haskell/Ceta.hs:5881:1-53 7.3 0.0 180981 0
rbt_ins Ceta CeTA-2.42/generated/Haskell/Ceta.hs:(10125,1)-(10135,46) 7.1 23.6 175869 418499570800
equal_char Ceta CeTA-2.42/generated/Haskell/Ceta.hs:5871:1-55 6.7 0.0 164627 176
compare Ceta CeTA-2.42/generated/Haskell/Ceta.hs:5907:3-24 5.4 0.0 134755 16
less Ceta CeTA-2.42/generated/Haskell/Ceta.hs:5885:3-18 4.6 0.0 114569 0
rm_iterateoi Ceta CeTA-2.42/generated/Haskell/Ceta.hs:(12705,1)-(12712,15) 3.8 8.6 93627 151723221568
compare_char Ceta CeTA-2.42/generated/Haskell/Ceta.hs:5904:1-28 3.4 0.0 83480 0
compare_list Ceta CeTA-2.42/generated/Haskell/Ceta.hs:4249:1-38 2.8 0.0 70102 64
== Ceta CeTA-2.42/generated/Haskell/Ceta.hs:5874:3-25 2.2 0.0 54978 0
paint Ceta CeTA-2.42/generated/Haskell/Ceta.hs:(2339,1)-(2340,46) 0.5 2.9 12080 50592874656
A function balance
does a lot of allocation. That's for building red-black trees, used to represent sets.
Then there's comparator_list
that takes a lot of time.
Could it be that the elements of the set (the keys of the tree) are strings? Well, then it certainly gets expensive.
Two quick thoughts:
most data is very short-lived. the collector says
1,831,682,981,848 bytes allocated in the heap
394,478,400 bytes copied during GC
7,305,240 bytes maximum residency (56 sample(s))
note: 2 TB allocated, max residency < 10 MB.
with some RTS options, I can get runtime down (from 11 s to 9 s) and GC time is 0.4 percent only
+RTS -M60g -A30g -G1
Productivity 99.6% of total user, 99.6% of total elapsed
I made a profile with hpc
, the top is
hpc show ceta-postproc.tix |grep Top|sort -nr -k2
23195 46792471618 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 694:1-694:48 TopLevelBox ["integer_of_char"]
23192 46792471618 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 691:1-691:36 TopLevelBox ["integer_of_ci"]
23189 46792471618 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 651:1-651:41 TopLevelBox ["rep_char_integer"]
9846 23962369769 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 4238:1-4246:34 TopLevelBox ["comparator_list"]
18667 23660888459 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 964:1-964:75 TopLevelBox ["comparator_of"]
23598 23659132122 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 5881:1-5881:53 TopLevelBox ["less_char"]
23464 23133102693 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 5871:1-5871:55 TopLevelBox ["equal_char"]
101730 23132574996 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 5874:3-5874:25 TopLevelBox ["=="]
12876 12901381636 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 10125:1-10135:46 TopLevelBox ["rbt_ins"]
9849 12016035216 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 4249:1-4249:38 TopLevelBox ["compare_list"]
101578 12016035216 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 4252:3-4252:24 TopLevelBox ["compare"]
12172 8016688211 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 2241:1-2319:26 TopLevelBox ["balance"]
3801 2115950552 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 12705:1-12712:15 TopLevelBox ["rm_iterateoi"]
6066 1124900877 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 10117:1-10117:19 TopLevelBox ["impl_of"]
11573 1054018222 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 2339:1-2340:46 TopLevelBox ["paint"]
12895 1054018210 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 10146:1-10146:53 TopLevelBox ["insert"]
12888 1054018210 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 10143:1-10143:49 TopLevelBox ["rbt_insert"]
12884 1054018210 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 10140:1-10140:55 TopLevelBox ["rbt_insert_with_key"]
13080 1053972026 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 20127:1-20127:36 TopLevelBox ["ins_rm_basic_ops"]
6064 226547896 ceta-postproc-2.42.0-8CNp1Qy7y5f8ZoGXdmhNI2/Ceta 10149:1-10154:28 TopLevelBox ["rbt_lookup"]
I find the high number of integer_of_char
suspicious. Certificate size is 242 kB, and each char should be processed only once?
But perhaps it's harmless, since
690 integer_of_ci :: Char_integer -> Integer;
691 integer_of_ci x = rep_char_integer x;
692
693 integer_of_char :: Char -> Integer;
694 integer_of_char (Char_of_ci x) = integer_of_ci x;
648 newtype Char_integer = Abs_char_integer Integer;
649
650 rep_char_integer :: Char_integer -> Integer;
651 rep_char_integer (Abs_char_integer x) = x;
Hm. This should be inlined away? Perhaps it is, and it's made visible only because of instrumentation for profiling. (Could be improved by -fprof-late
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7797 coming with ghc-9.4)
Still, the above question remains (red-black-tree with lists as keys)
program compiled with
cabal build --with-ghc=/opt/ghc/ghc-9.4.0.20220721/bin/ghc --enable-profiling -O2 --ghc-options=-fprof-late -v
gives this profile:
ceta-postproc +RTS -p -M60g -A30g -G1 -RTS /dev/stdin data/bound/3633.xml
total time = 800.28 secs (800284 ticks @ 1000 us, 1 processor)
total alloc = 1,735,038,859,928 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
$fCcompare(,)_$scomparator_list Ceta CeTA-2.42/generated/Haskell/Ceta.hs:4238:1-15 29.3 0.0
balance Ceta CeTA-2.42/generated/Haskell/Ceta.hs:2241:1-7 29.3 65.7
$s$srbt_ins12 Ceta CeTA-2.42/generated/Haskell/Ceta.hs:10125:1-7 28.7 22.3
rm_iterateoi Ceta CeTA-2.42/generated/Haskell/Ceta.hs:12705:1-12 6.6 7.3
lvl3943 Ceta <no location info> 5.1 4.3
conclusion: integer_of_char
is inlined away, red-black balancing and list-compare is prominent.
without changing the code: have GHC aggresively transform the code with
cabal install --with-ghc=/opt/ghc/ghc-9.4.0.20220721/bin/ghc -O2 --ghc-options="-fspecialize-aggressively -fexpose-all-unfoldings -O2"
then I am getting
lzip -dc data/bound/3633.cpf.ttt2.lz| time /home/waldmann/.cabal/bin/ceta-postproc /dev/stdin data/bound/3633.xml +RTS -M60g -A30g -G1
starexec-result="YES"
certification-result="CERTIFIED"
certification-time="442.7"
output-size="242320"
With plain O2 for the compiler, and the same RTS options, the executable built with ghc-9.2.4
gets
certification-time="523.6"
With same (aggressive) compiler and RTS options, executable built with ghc-8.10.7
gets
certification-time="431.1"
debugging with ghci, set breakpoint at function check_ta_bounded
, load a certificate, then :force ta
, will print
[CeTA-2.42/generated/Haskell/Ceta.hs:(36670,3)-(36686,13)] ghci> :force ta
ta = Ceta.TA_Impl
(Ceta.RBT (Ceta.Branch
Ceta.B
(Ceta.Branch
Ceta.R
(Ceta.Branch
Ceta.B (Ceta.Branch Ceta.R Ceta.Empty [49] () Ceta.Empty) [50] ()
Ceta.Empty)
[51] () (Ceta.Branch Ceta.B Ceta.Empty [52] () Ceta.Empty))
[53] () (Ceta.Branch Ceta.B Ceta.Empty [54] () Ceta.Empty)))
(Ceta.RBT (Ceta.Branch
Ceta.B
(Ceta.Branch
Ceta.B
(Ceta.Branch
Ceta.B
(Ceta.Branch
Ceta.B Ceta.Empty ((Ceta.UnLab [48],Ceta.Nat 0),Ceta.Nat 1)
[Ceta.TA_rule_impl
(Ceta.UnLab [48],Ceta.Nat 0) [[49]] [49]
(Ceta.RBT (Ceta.Branch
Ceta.B
(Ceta.Branch
Ceta.B
(Ceta.Branch
Ceta.B Ceta.Empty [49] ()
(Ceta.Branch
Ceta.R Ceta.Empty [49,50,52] () Ceta.Empty))
...
([49,56,49,57],[56,55,54]),([50,48,48,49],[51,51,48]),
([49,48,50,52],[50,51,50,57]),([49,48,50,52],[50,52,50,55]),
([49,55,51,49],[56,51,53]),([56,56,51],[56,55,52]),
([49,57,54],[49,56,53]),([49,57,54],[54,50,55]),
([50,50,50,55],[55,52,55]),([49,49,49,53],[50,53,55]),
([52,51,53],[52,50,53]),([50,52,54,57],[55,54,49])]
_ _
and this looks like the states of the automaton are represented as list-of-char, e.g.. [49,50,52]
(means 124
)
Oh yes indeed, state=string
, http://cl2-informatik.uibk.ac.at/rewriting/mercurial.cgi/IsaFoR/file/c41d7d152c37/xml/cpf.xsd#l1420
1420 <xs:element name="state" type="xs:string">
1421 <xs:annotation>
1422 <xs:documentation>the state of a tree automation</xs:documentation>
1423 </xs:annotation>
1424 </xs:element>
1425 <xs:element name="treeAutomaton">
1426 <xs:complexType>
1427 <xs:sequence>
1428 <xs:element name="finalStates">
1429 <xs:complexType>
1430 <xs:sequence>
1431 <xs:element maxOccurs="unbounded" minOccurs="0" ref="state"/>
1432 </xs:sequence>
1433 </xs:complexType>
1434 </xs:element>
...
Well this makes it expensive.
I implemented CPF output of full matchbounds for matchbox (I mean, it's already in the name, so ..). Encoding of state to string is configurable. The following
helps to reduce run-time of ceta by 25 percent.
; mb = Apply (Worker (Matchbound { method = Full
, cpf_state_encoding = Encoding {
-- test cases (timing given after config line)
-- Zantema_04/z097 (full mb 4 with 885 states)
-- / Secret_05_SRS/jambox3 (full mb 8 with 1843 states)
-- base = 10 , order = MSB_First -- 7.7 seconds / 310 s
-- base = 10 , order = LSB_First -- 6.4 seconds
-- base = 64 , order = MSB_First -- 7.0 seconds
base = 64 , order = LSB_First -- 5.6 seconds / 230 s
-- for fun:
-- base = 2 , order = MSB_First -- 11 s
-- base = 2 , order = LSB_First -- 10 s
}
} )) done
With ceta-states-integer https://lists.rwth-aachen.de/hyperkitty/list/termtools@lists.rwth-aachen.de/message/Y65SMRMZLX42LIUYRZRX3TGYT73X64YY/
I am getting
starexec-result="YES"
certification-result="CERTIFIED"
certification-time="374.3"
output-size="242320"
it's an improvement (15 percent). The profile is
ceta-postproc-integer-new +RTS -p -M60g -A30g -G1 -Sstderr -RTS data/bound/3633.cpf.ttt2~ data/bound/3633.xml
total time = 565.60 secs (565595 ticks @ 1000 us, 1 processor)
total alloc = 1,859,816,356,672 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
balance Ceta ceta-states-integer/new/Ceta.hs:2209:1-7 41.7 66.0
certify_proof_$s$srbt_ins12 Ceta ceta-states-integer/new/Ceta.hs:10180:1-7 40.8 22.2
rm_iterateoi Ceta ceta-states-integer/new/Ceta.hs:12743:1-12 9.4 7.1
certify_proof1654 Ceta <no location info> 6.5 4.3
good - comparator_list
is gone. Now RB operations dominate everything.
To see the call chains, I made a profile without optimisation,(else, functions might get inlined away). It seems that this is expensive:
check_bounds_generic.\.\.\.\.\.\.\.\ Ceta ceta-states-integer/new/Ceta.hs:(37199,12)-(37237,27) 4964 0 0.0 0.0 94.0 94.1
state_compatible_eff_list Ceta ceta-states-integer/new/Ceta.hs:(36654,1)-(36659,29) 4965 0 0.0 0.0 94.0 94.1
state_compatible_eff_list.\ Ceta ceta-states-integer/new/Ceta.hs:36658:28-71 5228 1027 0.0 0.0 0.0 0.0
catch_errora Ceta ceta-states-integer/new/Ceta.hs:(13800,1)-(13803,21) 5229 1027 0.0 0.0 0.0 0.0
state_compatible_eff_list.check Ceta ceta-states-integer/new/Ceta.hs:36656:5-49 5230 0 0.0 0.0 94.0 94.1
rule_state_compatible_eff_list Ceta ceta-states-integer/new/Ceta.hs:(36623,1)-(36645,32) 5231 0 0.0 0.0 94.0 94.1
rule_state_compatible_eff_list.\ Ceta ceta-states-integer/new/Ceta.hs:(36632,10)-(36645,31) 5232 1027 0.0 0.0 92.4 93.1
...
ta_match_impl.ep Ceta ceta-states-integer/new/Ceta.hs:36571:5-29 5306 118530 0.0 0.0 91.5 92.5
rs_Union Ceta ceta-states-integer/new/Ceta.hs:36469:1-77 5307 118530 0.0 0.0 91.5 92.5
foldl Ceta ceta-states-integer/new/Ceta.hs:(3150,1)-(3151,39) 5308 1516702 0.0 0.0 91.5 92.5
g_union_dflt_basic_oops_rm_basic_ops Ceta ceta-states-integer/new/Ceta.hs:(36464,1)-(36466,23) 5309 1398172 0.0 0.0 91.5 92.5
iteratei_bset_op_list_it_dflt_basic_oops_rm_basic_ops Ceta ceta-states-integer/new/Ceta.hs:(28195,1)-(28196,49) 5310 1398172 0.0 0.0 91.5 92.5
iteratei_bset_op_list_it_dflt_basic_oops_rm_basic_ops.\ Ceta ceta-states-integer/new/Ceta.hs:28196:13-48 5311 1398172 0.0 0.0 91.5 92.5
rm_iterateoi Ceta ceta-states-integer/new/Ceta.hs:(12743,1)-(12750,15) 5312 2032465262 8.1 5.6 91.5 92.5
g_union_dflt_basic_oops_rm_basic_ops.\ Ceta ceta-states-integer/new/Ceta.hs:36465:68-71 5322 2031067090 0.0 0.0 0.0 0.0
ins_rm_basic_ops Ceta ceta-states-integer/new/Ceta.hs:20289:1-36 5323 1015533545 0.3 0.0 82.7 86.9
insert Ceta ceta-states-integer/new/Ceta.hs:10201:1-53 5324 1015533545 1.1 0.7 82.5 86.9
impl_of Ceta ceta-states-integer/new/Ceta.hs:10172:1-19 5329 1015533545 0.0 0.0 0.0 0.0
rbt_insert Ceta ceta-states-integer/new/Ceta.hs:10198:1-49 5325 1015533545 0.6 1.2 81.3 86.2
rbt_insert_with_key Ceta ceta-states-integer/new/Ceta.hs:10195:1-55 5326 1015533545 1.4 2.4 80.7 84.9
rbt_ins Ceta ceta-states-integer/new/Ceta.hs:(10180,1)-(10190,46) 5328 13031651269 20.7 36.0 78.5 81.0
balance Ceta ceta-states-integer/new/Ceta.hs:(2209,1)-(2287,26) 5337 8146796632 15.7 45.0 15.7 45.0
compare Ceta ceta-states-integer/new/Ceta.hs:7266:3-27 5332 0 6.9 0.0 42.2 0.0
compare_integer Ceta ceta-states-integer/new/Ceta.hs:7263:1-31 5333 0 2.6 0.0 35.2 0.0
comparator_of Ceta ceta-states-integer/new/Ceta.hs:932:1-75 5334 12018216912 19.1 0.0 32.6 0.0
less Ceta ceta-states-integer/new/Ceta.hs:634:3-25 5335 0 2.5 0.0 13.5 0.0
less.\ Ceta ceta-states-integer/new/Ceta.hs:634:20-24 5336 12018216912 11.0 0.0 11.0 0.0
paint Ceta ceta-states-integer/new/Ceta.hs:(2307,1)-(2308,46) 5327 1015533545 0.4 1.5 0.4 1.5
compare Ceta ceta-states-integer/new/Ceta.hs:7266:3-27 5331 0 0.4 0.0 0.4 0.0
rm_iterateoi.sigmaa Ceta ceta-states-integer/new/Ceta.hs:12747:12-44 5330 1015533545 0.7 0.0 0.7 0.0
The code seems to be http://cl2-informatik.uibk.ac.at/rewriting/mercurial.cgi/IsaFoR/file/f77808a4258d/thys/Tree_Automata/Tree_Automata_Impl.thy#l367
fun ta_match_impl :: "('f \<times> nat,('q :: linorder,'f :: linorder)ta_rule_impl list)rm \<Rightarrow> 'q rs \<Rightarrow> ('q \<Rightarrow> 'q rs) \<Rightarrow> ('f,'v)term \<Rightarrow> 'q list \<Rightarrow> ('v :: linorder \<times> 'q)list rs"
where "ta_match_impl TA Qsig eps (Var x) Q = rs.from_list (map (\<lambda>q'. [(x,q')]) (rs.to_list (rs.inter (rs_Union (map eps Q)) Qsig)))"
| "ta_match_impl TA Qsig eps (Fun f ts) Q = (let
n = length ts;
rules = rm_set_lookup TA (f, n);
ep = rs_Union (map eps Q);
f = (\<lambda> rule. rs.from_list (case rule of TA_rule_impl _ qs q' qs' \<Rightarrow>
(if rs.memb q' ep then
(let rec = map (\<lambda> (tsi,qsi). rs.to_list (ta_match_impl TA Qsig eps tsi [qsi])) (zip ts qs)
in map concat (concat_lists rec))
else [])))
in rs_Union (map f rules))"
in Haskell:
ta_match_impl ::
forall a b c.
(Compare_order a, Compare_order b, Eq b, Compare_order c,
Eq c) => Rbt (a, Nat) [Ta_rule_impl b a] ->
Rbt b () ->
(b -> Rbt b ()) -> Term a c -> [b] -> Rbt [(c, b)] ();
ta_match_impl ta qsig eps (Var x) q =
g_from_list_dflt_basic_oops_rm_basic_ops
(map (\ qa -> [(x, qa)])
(g_to_list_dflt_basic_oops_rm_basic_ops
(g_inter_dflt_basic_oops_rm_basic_ops (rs_Union (map eps q)) qsig)));
ta_match_impl ta qsig eps (Fun f ts) q =
let {
n = size_list ts;
rules = rm_set_lookup ta (f, n);
ep = rs_Union (map eps q);
fa = (\ rule ->
g_from_list_dflt_basic_oops_rm_basic_ops
(case rule of {
TA_rule_impl _ qs qa _ ->
(if memb_rm_basic_ops qa ep
then let {
rec = map (\ (tsi, qsi) ->
g_to_list_dflt_basic_oops_rm_basic_ops
(ta_match_impl ta qsig eps tsi [qsi]))
(zip ts qs);
} in map concat (concat_lists rec)
else []);
}));
} in rs_Union (map fa rules);
ICFP/3633 : ttt2 produces compatible automaton with 2000 states, ceta needs 25 min certification time on starexec.