Closed ngsankha closed 2 years ago
Constants are handled as of commit f9611a4a72877e34e3214cead2fdcf75ff7bbfd7.
The index of rows/columns are not yet enabled. It might help eliminate more programs by abstract interpretation.
Intermediate benchmarking results after enabling constants and running the tool on each benchmark with 20min timeout (same as AutoPandas paper). We do 17/29 benchmarks vs 17/26 on the original paper. I suspect the numbers to be better after the rows/columns domain.
SO_49581206_depth3
------------------
ERROR!
SO_13576164_depth3
------------------
ERROR!
SO_14023037_depth3
------------------
ERROR!
SO_23321300_depth3
------------------
ERROR!
SO_13807758_depth2
------------------
arg0.dropna().reset_index(drop=True)
48.4200041539998
SO_49567723_depth2
------------------
ERROR!
SO_11811392_depth3
------------------
arg0.T.reset_index().values
4.165910928999438
SO_10982266_depth3
------------------
ERROR!
SO_18172851_depth1
------------------
arg0.loc[arg1]
1.152512497999851
SO_49987108_depth2
------------------
ERROR!
SO_49583055_depth1
------------------
arg0.sort_values(by=["ID"])
4.137355733999357
SO_49583055_depth1
------------------
arg0.sort_values(by=["ID"])
4.092509506999704
SO_49572546_depth1
------------------
arg1.combine_first(arg0)
2.0150949630005925
SO_39656670_depth3
------------------
ERROR!
SO_11881165_depth1
------------------
arg0.loc[[0, 2, 4]]
1.057938928999647
SO_11941492_depth1
------------------
ERROR!
SO_21982987_depth3
------------------
ERROR!
SO_53762029_depth3
------------------
arg0.pivot_table(index=["doc_created_month", "doc_created_year", "speciality"]).cumsum()
302.4615383049986
SO_13261691_depth2
------------------
arg0.stack().unstack()
50.20894425500046
SO_12065885_depth3
------------------
arg0.loc[[2, 4, 6]]
1.0424643039987131
SO_13261175_depth1
------------------
arg0.pivot_table(values="value", index="name", columns=["type", "date"])
276.5202395289998
SO_13659881_depth2
------------------
arg0.groupby(["ip", "useragent"]).size()
1.0170146630007366
SO_14085517_depth1
------------------
arg0.sort_values(by=["SEGM1"])
195.83883650999996
SO_34365578_depth2
------------------
ERROR!
SO_11418192_depth2
------------------
arg0.query(arg2)
0.9639060649988096
SO_49592930_depth1
------------------
arg0.combine_first(arg1)
0.9801300960007211
SO_13793321_depth1
------------------
arg0.merge(arg1, on=10)
5.554725879999751
SO_13647222_depth1
------------------
ERROR!
SO_12860421_depth1
------------------
arg0.pivot_table(values="Z", index="Y", columns="Z", aggfunc=pd.Series.nunique)
683.4204104500004
This is implemented now!
The one in the code base is disabled right now. A proper pandas rows/columns domain needs to track the following information:
IndexKind
: There are various kind of possible indexes in Pandas. Two domains are comparable if they have the same index kinds.IndexLabels
: A set of index labels. Could be numeric, date time, strings or even tuples. This behaves like a mathematical set.Implementing this correctly should give us most constant strings automatically which are now supplied upfront.