Closed VelizarVESSELINOV closed 5 years ago
Can you try on master? There were a few issues with quantile in 0.25.0 and 0.25.1 that will be fixed in the next minor release
Can you provide a reproducible example? Here is a good reference for making bug reports: https://blog.dask.org/2018/02/28/minimal-bug-reports
@WillAyd, when we switched from quantile to mean, is working much better, so suspect quantile issue. We will test with 0.25.2 when it is available to switch from mean to quantile and will let you know if there is still an issue. Thanks.
@dsaxton it is difficult to write "good" stable example reproducing MALLOC_LARGE issue :) it is not 100% reproducible even with the same datasets. The one that it is catching the most of malloc issues is with more than 4M rows and group by of ~30K intervals.
Ran into the same issue today. The problem is with None
keys. Here is some C&P'able code:
import pandas
print(pandas.__version__)
df = pandas.DataFrame(data=[['A', 0], [None, 123], ['A', 0]], columns=['key', 'value'])
result = df.groupby('key').quantile()
print(result)
This works fine with pandas 0.24.2, but with 0.25.1 there are several issues:
None
as an additional category would also be acceptable. But mixing it into one of the other groups is just wrong. (If you have multiple different groups, it will get mixed into whatever group comes first in the result).result
is garbage collected. Other, slightly different variations may crash immediately. For example, adding as_index=False
to the groupby
will crash immediately.Unfortunately, I cannot test this with the latest master branch, as I am on my business laptop where I cannot easily install the required C++ build tools. I will test this on another machine of mine but it may be several hours before I get around to that.
I got it installed. Here are results for pandas 0.26.0.dev0+583.g86e187f:
as_index=False
will crash immediately. Using a list of quantiles (p.e. df.groupby('key').quantile([0, 0.5, 1])
) crashes immediately. Using the similar example from the other thread on this topic crashes immediately.Confirm, the same crash with official 0.25.2 release
dtf = dtf.groupby(cut(dtf[col[0]], rng, duplicates='drop', precision=4)).quantile(.5) # Crash
dtf = dtf.groupby(cut(dtf[col[0]], rng, duplicates='drop', precision=4)).mean() # OK
This segfaults for me ~10% of the time on master
import pandas
print(pandas.__version__)
df = pandas.DataFrame(data=[['A', 0], [None, 123], ['A', 0]], columns=['key', 'value'])
result = df.groupby('key').quantile([0.25, 0.75])
If this was working on 0.24 but not 0.25 probably comes back to #20405
I'll see what I can find
Are you able to reproduce locally Will? If not, I can spend some time digging into it.
On Tue, Oct 22, 2019 at 4:14 PM William Ayd notifications@github.com wrote:
If this was working on 0.24 but not 0.25 probably comes back to #20405 https://github.com/pandas-dev/pandas/pull/20405
I'll see what I can find
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28882?email_source=notifications&email_token=AAKAOIXQ5QWA4SUMIK4JPXTQP5UL7A5CNFSM4I7FRTX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB7HL6I#issuecomment-545158649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ3JUT6UNCGPKNEVG3QP5UL7ANCNFSM4I7FRTXQ .
Yea - the last sample you posted was great and I see the same behaviour
On Oct 22, 2019, at 2:27 PM, Tom Augspurger notifications@github.com wrote:
Are you able to reproduce locally Will? If not, I can spend some time digging into it.
On Tue, Oct 22, 2019 at 4:14 PM William Ayd notifications@github.com wrote:
If this was working on 0.24 but not 0.25 probably comes back to #20405 https://github.com/pandas-dev/pandas/pull/20405
I'll see what I can find
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28882?email_source=notifications&email_token=AAKAOIXQ5QWA4SUMIK4JPXTQP5UL7A5CNFSM4I7FRTX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB7HL6I#issuecomment-545158649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ3JUT6UNCGPKNEVG3QP5UL7ANCNFSM4I7FRTXQ .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28882?email_source=notifications&email_token=AAEU4UOJOTUVW6TL4N5UV7LQP5V23A5CNFSM4I7FRTX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB7IPPA#issuecomment-545163196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEU4UOAFDRLPMBHOGJS6GLQP5V23ANCNFSM4I7FRTXQ.
Believe I have figured it out. The problem is on this line:
When NA values appear in the groupby they are indicated with a -1 label. However via decorators we disabled Cython.wraparound
for negative indexing AND also Cython.boundscheck
to detect index errors. The combination of all of those factors leads to a segfault
I'll push up a PR soon. Just thinking through test case(s)
Code Sample, a copy-pastable example if possible
Problem description
Process: Python [24642] Path: /Library/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python Identifier: Python Version: 3.7.4 (3.7.4) Code Type: X86-64 (Native) Parent Process: Python [24593] Responsible: iTerm2 [1703] User ID: 501
Date/Time: 2019-10-09 17:11:04.949 -0500 OS Version: Mac OS X 10.15 (19A583) Report Version: 12 Bridge OS Version: 3.0 (14Y904) Anonymous UUID: F986CCB3-5DD1-9587-8492-6D8B8A43979D
Sleep/Wake UUID: 42F77302-9822-4979-89CB-7C39F3C0556A
Time Awake Since Boot: 67000 seconds Time Since Wake: 1900 seconds
System Integrity Protection: enabled
Crashed Thread: 7
Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x000000013a6dcff8 Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Segmentation fault: 11 Termination Reason: Namespace SIGNAL, Code 0xb Terminating Process: exc handler [24642]
VM Regions Near 0x13a6dcff8: MALLOC_LARGE 000000013a5ee000-000000013a633000 [ 276K] rw-/rwx SM=PRV
--> MALLOC_LARGE 000000013a6dd000-000000013a74d000 [ 448K] rw-/rwx SM=PRV
0 groupby.cpython-37m-darwin.so 0x000000011b1596cf pyx_fuse_9pyx_pw_6pandas_5_libs_7groupby_125group_quantile + 6719 1 algos.cpython-37m-darwin.so 0x0000000119e8937c __pyx_FusedFunction_call + 812
Expected Output
Output of
pd.show_versions()