waylonflinn commented 9 years ago

bquery doesn't seem to support python 3. When I try to build with python 3, I get the following error:

NameError: name 'execfile' is not defined

waylonflinn commented 9 years ago

I'm working on a PR for this.

I have it compiling, installing and running, with most tests passing but I get this error when doing a groupby on a string column:

Exception ignored in: 'bquery.ctable_ext._factorize_str_helper'
TypeError: expected bytes, numpy.str_ found

Any suggestions about cause or solutions?

CarstVaartjes commented 9 years ago

@waylonflinn thanks for the interest and help! We definitely should support Python 3, so you've found a bit of a gap here oops! @FrancescElies I think this is an issue with the test right? (the difference between python 2 and 3 in terms of str vs unicode while numpy stays string?)

waylonflinn commented 9 years ago

Glad to hear you guys are interested in supporting Python 3!

I'll go ahead and submit the (unfinished) PR, so you guys can take a look at it and offer suggestions, if you have time.

waylonflinn commented 9 years ago

PR #40 submitted

FrancescElies commented 9 years ago

I just replied inside PR #40 to make a small modification to travis conf to make this error visible

waylonflinn commented 9 years ago

Small update.

The error above:

Exception ignored in: 'bquery.ctable_ext._factorize_str_helper'
TypeError: expected bytes, numpy.str_ found

was related to using the numpy unicode type (U) instead of the numpy string type (S) for strings in the bcolz datastore. The workaround for this was to use S instead of U when declaring the dtypes.

The failed unit test appears to be related to strings (as bytes) not sorting properly in Python 3 when using the sorted_count_distinct aggregation method.

waylonflinn commented 9 years ago

Failing test results for test_groupby_09: Groupby's type 'sorted_count_distinct'

#--> Bcolz
[(b'a', 97.0, 215, 319) (b'c', 99.0, 222, 344) (b'b', 100.0, 225, 337)]
#--> Itertools
[[b'a', 97, 215, 319], [b'b', 100, 225, 337], [b'c', 99, 222, 344]]

The values are correct, but the ordering is different. Itertools results appear to be sorted by the string key ('a', 'b', 'c'). Bcolz results are not ('a', 'c', 'b'). I've spent some time with the code and I haven't figured out why (or whether this is important).

I have a few questions:

What is sorted_count_distinct supposed to be doing?
What are the constraints on the input to sorted_count_distinct (ctable.py says "data should have been previously presorted")?
Is this an actual deviation from expected behaviour, or an artifact of the test?

CarstVaartjes commented 9 years ago

Hi!

A short mail from my phone: the sorting of the result should not matter. The sorted count distinct is a heavily optimized count distinct operation that assumes that a value in a column might repeat directly in sequence, but not separately later anymore. So value x might be in row 1,2,3 and then followed by value y in row 4. We assume here that value x will not occur anymore down the road. This means that if you presort a table you can do very optimized count distinct operations, we use this a lot for retail analytics (basket counts where product a and b were present for instance). You could also use it for other industries (how many patients had x and y) So the input of the column that you do the count distinct on needs to be sorted. How it outputs it not necessarily though it is a bit weird that Python 2 does this differently

Br. Carst

Sent from Outlookhttp://taps.io/outlookmobile

On Wed, Jun 10, 2015 at 9:35 AM -0700, "Waylon Flinn" notifications@github.com<mailto:notifications@github.com> wrote:

Failing test results for test_groupby_09: Groupby's type 'sorted_count_distinct'

--> Bcolz

[(b'a', 97.0, 215, 319) (b'c', 99.0, 222, 344) (b'b', 100.0, 225, 337)]

--> Itertools

[[b'a', 97, 215, 319], [b'b', 100, 225, 337], [b'c', 99, 222, 344]]

The values are correct, but the ordering is different. Itertools results appear to be sorted by the string key ('a', 'b', 'c'). Bcolz results are not ('a', 'c', 'b'). I've spent some time with the code and I haven't figured out why (or whether this is important).

I have a few questions:

What is sorted_count_distinct supposed to be doing?
What are the constraints on the input to sorted_count_distinct (ctable.py says "data should have been previously presorted")?
Is this an actual deviation from expected behaviour, or an artifact of the test?

Reply to this email directly or view it on GitHubhttps://github.com/visualfabriq/bquery/issues/39#issuecomment-110825251.

This email (including any attachments to it) is confidential, legally privileged, subject to copyright and is sent for the personal attention of the intended recipient only. If you have received this email in error, please advise us immediately and delete it. You are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Although we have taken reasonable precautions to ensure no viruses are present in this email, we cannot accept responsibility for any loss or damage arising from the viruses in this email or attachments. We exclude any liability for the content of this email, or for the consequences of any actions taken on the basis of the information provided in this email or its attachments, unless that information is subsequently confirmed in writing.

waylonflinn commented 9 years ago

I'm also a bit confused and intrigued by the fact that this fails on python 3 but not python 2. I'm currently wondering if this could have something to do with differences in random number generation between the two versions. (Possibly related: Why is seeding the random generator not stable between versions of Python? )

As a potential quick fix, I notice that several of the other tests sort their output before doing the assert. Does that seem like a reasonable thing to try here as well?

waylonflinn commented 9 years ago

Sorting the input appears to fix the broken test.

CarstVaartjes commented 9 years ago

That's because the count mechanism takes up unique values in order. Either way to solve the test is okay, because we're not checking the order really but the numerical part. so if that makes it pass,that's perfect

CarstVaartjes commented 9 years ago

I've merged the code, let me know if it's okay to close

waylonflinn commented 9 years ago

@CarstVaartjes thanks! looks good to me!

CarstVaartjes commented 9 years ago

(let me know what you want in the release notes :)

waylonflinn commented 9 years ago

maybe something like this: "support for Python 3 added by Waylon Flinn (@waylonflinn)"

modify it however you want. glad to help :)

visualfabriq / bquery

Python 3 Support #39

--> Bcolz

--> Itertools