Closed waylonflinn closed 9 years ago
I'm working on a PR for this.
I have it compiling, installing and running, with most tests passing but I get this error when doing a groupby on a string column:
Exception ignored in: 'bquery.ctable_ext._factorize_str_helper'
TypeError: expected bytes, numpy.str_ found
Any suggestions about cause or solutions?
@waylonflinn thanks for the interest and help! We definitely should support Python 3, so you've found a bit of a gap here oops! @FrancescElies I think this is an issue with the test right? (the difference between python 2 and 3 in terms of str vs unicode while numpy stays string?)
Glad to hear you guys are interested in supporting Python 3!
I'll go ahead and submit the (unfinished) PR, so you guys can take a look at it and offer suggestions, if you have time.
PR #40 submitted
I just replied inside PR #40 to make a small modification to travis conf to make this error visible
Small update.
The error above:
Exception ignored in: 'bquery.ctable_ext._factorize_str_helper'
TypeError: expected bytes, numpy.str_ found
was related to using the numpy
unicode type (U
) instead of the numpy string type (S
) for strings in the bcolz datastore. The workaround for this was to use S
instead of U
when declaring the dtypes.
The failed unit test appears to be related to strings (as bytes) not sorting properly in Python 3 when using the sorted_count_distinct
aggregation method.
Failing test results for test_groupby_09: Groupby's type 'sorted_count_distinct'
#--> Bcolz
[(b'a', 97.0, 215, 319) (b'c', 99.0, 222, 344) (b'b', 100.0, 225, 337)]
#--> Itertools
[[b'a', 97, 215, 319], [b'b', 100, 225, 337], [b'c', 99, 222, 344]]
The values are correct, but the ordering is different. Itertools results appear to be sorted by the string key ('a', 'b', 'c'). Bcolz results are not ('a', 'c', 'b'). I've spent some time with the code and I haven't figured out why (or whether this is important).
I have a few questions:
sorted_count_distinct
supposed to be doing?sorted_count_distinct
(ctable.py
says "data should have been previously presorted")?Hi!
A short mail from my phone: the sorting of the result should not matter. The sorted count distinct is a heavily optimized count distinct operation that assumes that a value in a column might repeat directly in sequence, but not separately later anymore. So value x might be in row 1,2,3 and then followed by value y in row 4. We assume here that value x will not occur anymore down the road. This means that if you presort a table you can do very optimized count distinct operations, we use this a lot for retail analytics (basket counts where product a and b were present for instance). You could also use it for other industries (how many patients had x and y) So the input of the column that you do the count distinct on needs to be sorted. How it outputs it not necessarily though it is a bit weird that Python 2 does this differently
Br. Carst
Sent from Outlookhttp://taps.io/outlookmobile
On Wed, Jun 10, 2015 at 9:35 AM -0700, "Waylon Flinn" notifications@github.com<mailto:notifications@github.com> wrote:
Failing test results for test_groupby_09: Groupby's type 'sorted_count_distinct'
[(b'a', 97.0, 215, 319) (b'c', 99.0, 222, 344) (b'b', 100.0, 225, 337)]
[[b'a', 97, 215, 319], [b'b', 100, 225, 337], [b'c', 99, 222, 344]]
The values are correct, but the ordering is different. Itertools results appear to be sorted by the string key ('a', 'b', 'c'). Bcolz results are not ('a', 'c', 'b'). I've spent some time with the code and I haven't figured out why (or whether this is important).
I have a few questions:
Reply to this email directly or view it on GitHubhttps://github.com/visualfabriq/bquery/issues/39#issuecomment-110825251.
This email (including any attachments to it) is confidential, legally privileged, subject to copyright and is sent for the personal attention of the intended recipient only. If you have received this email in error, please advise us immediately and delete it. You are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Although we have taken reasonable precautions to ensure no viruses are present in this email, we cannot accept responsibility for any loss or damage arising from the viruses in this email or attachments. We exclude any liability for the content of this email, or for the consequences of any actions taken on the basis of the information provided in this email or its attachments, unless that information is subsequently confirmed in writing.
I'm also a bit confused and intrigued by the fact that this fails on python 3 but not python 2. I'm currently wondering if this could have something to do with differences in random number generation between the two versions. (Possibly related: Why is seeding the random generator not stable between versions of Python? )
As a potential quick fix, I notice that several of the other tests sort their output before doing the assert. Does that seem like a reasonable thing to try here as well?
Sorting the input appears to fix the broken test.
That's because the count mechanism takes up unique values in order. Either way to solve the test is okay, because we're not checking the order really but the numerical part. so if that makes it pass,that's perfect
I've merged the code, let me know if it's okay to close
@CarstVaartjes thanks! looks good to me!
(let me know what you want in the release notes :)
maybe something like this: "support for Python 3 added by Waylon Flinn (@waylonflinn)"
modify it however you want. glad to help :)
bquery doesn't seem to support python 3. When I try to build with python 3, I get the following error:
NameError: name 'execfile' is not defined