queryverse / Query.jl

Query almost anything in julia
Other
396 stars 50 forks source link

@filter gets stuck #240

Open ezgikurt opened 5 years ago

ezgikurt commented 5 years ago

Hi, I have this issue in both 1.0.3 and 1.1.0 (I'm using Windows 10). I load a .dta file into a DataFrame and then try filtering it using @filter(_.) macro but it seems to get stuck until infinity unless i ctrl + c and force an exit. The same df works fine in 0.6.4 but loading the data takes much longer. I know that these are experimental functionalities, are they not available on Julia 1.*? @map seems to work fine for example (using on data loaded from CSV).

davidanthoff commented 5 years ago

They are no longer experimental and should just work, so this certainly appears to be a bug.

Could you post the exact code you are using?

ezgikurt commented 5 years ago

Sure, here it is the current code I'm trying out. Normally I use multiple conditions etc. but even this is not working.

using Queryverse
df = DataFrame(load("./data/ibes_adj.dta"));
df |> @filter(_.ticker == "AAPL")

It gets stuck in the last line. I've waited upwards of 20 minutes. The same issue is here if I do using Query, DataFrames, StatFiles.

However I have an interesting find. I also have the same dataset in .csv format and when I load it from the CSV @ filter works. typeof(df) returns DataFrame in both instances. (loading from a .csv takes much longer than loading from .dta though).

The same code works perfectly fine in 0.6.4.

ps. @davidanthoff btw thanks for the intro to queryverse video, it completely changed how i approach this particular dataset!

davidanthoff commented 5 years ago

Any chance that you could share the file? If you can't upload it here for some reason, maybe you could email it to me? My email is on my profile page at https://github.com/davidanthoff.

ezgikurt commented 5 years ago

@davidanthoff I've sent it via email.

davidanthoff commented 5 years ago

Thanks, got the file! It is quite mysterious what is happening there. The issue is somewhere in Tables.jl, as far as I can tell, and unrelated to Query.jl, but I don't yet fully understand what the problem is (at least that is what I think right now, given that df |> getiterator |> collect also hangs forever, which should not exercise any Query.jl related stuff).

One way around this is to just skip the DataFrame materialization entirely and thereby avoid the code path through Tables.jl: df = load("./data/ibes_adj.dta") |> @filter(_.ticker == "AAPL") works for me.

ezgikurt commented 5 years ago

Hi @davidanthoff, thanks for the reply & sorry for the late comment. Unfortunately I am relying on the DataFrames too much in other parts of the script to be able to skip it, but I can use the .csv file for now.

I'd also like to add that this issue might be endemic to that specific dataset, because I have another .dta that is larger (190k rows instead of 140k) and does not have the same issue.

JonasIsensee commented 5 years ago

I can reproduce this error with a DataFrame of my own as well. Both LINQ style and stand-alone operators hang. ( all that I have tried so far )

I initially thought it might be due to some funky column types but when I converted all columns to Vector{Any} I got a completely different error on a query:

julia> df = DataFrame(a = Any[rand(5)...], b=Any[rand(5)...])
5×2 DataFrame
│ Row │ a         │ b         │
│     │ Any       │ Any       │
├─────┼───────────┼───────────┤
│ 1   │ 0.481095  │ 0.754002  │
│ 2   │ 0.986754  │ 0.65757   │
│ 3   │ 0.387108  │ 0.318604  │
│ 4   │ 0.0094196 │ 0.920775  │
│ 5   │ 0.283106  │ 0.0891282 │

julia> df |> @filter(_.a > 0.3) |> DataFrame
ERROR: UndefVarError: DataValueUnwrapper not defined
Stacktrace:
 [1] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Array{NamedTuple{(:a, :b),Tuple{Any,Any}},1}}},getfield(Main, Symbol("##38#40"))}) at /home/jonas/.julia/packages/DataFrames/z2XOB/src/other/tables.jl:27
 [2] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Array{NamedTuple{(:a, :b),Tuple{Any,Any}},1}}},getfield(Main, Symbol("##38#40"))}, ::Type) at ./operators.jl:813
 [3] top-level scope at none:0

EDIT: the same error shows up without converting to Vector{Any}.

julia> df2 = DataFrame(a = rand(5), b=rand(5))
5×2 DataFrame
│ Row │ a        │ b         │
│     │ Float64  │ Float64   │
├─────┼──────────┼───────────┤
│ 1   │ 0.726983 │ 0.214645  │
│ 2   │ 0.201247 │ 0.848875  │
│ 3   │ 0.344053 │ 0.374592  │
│ 4   │ 0.367191 │ 0.710325  │
│ 5   │ 0.231486 │ 0.0913907 │

julia> df2 |> @filter(_.a > 0.3) |> DataFrame
ERROR: UndefVarError: DataValueUnwrapper not defined
Stacktrace:
 [1] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Float64,Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Array{NamedTuple{(:a, :b),Tuple{Float64,Float64}},1}}},getfield(Main, Symbol("##66#68"))}) at /home/jonas/.julia/packages/DataFrames/z2XOB/src/other/tables.jl:27
 [2] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Float64,Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Array{NamedTuple{(:a, :b),Tuple{Float64,Float64}},1}}},getfield(Main, Symbol("##66#68"))}, ::Type) at ./operators.jl:813
 [3] top-level scope at none:0
davidanthoff commented 5 years ago

@JonasIsensee I think what you are seeing must be a different issue. What versions of packages are you using? I can't reproduce your latest code example.

JonasIsensee commented 5 years ago
julia v1.0.3
Query v0.11.0
DataFrames v0.16.0
Tables v0.1.15
TableTraits v0.4.1
QueryOperators v0.7.0
IterableTables v0.10.0
DataValues v0.4.7

There's got to be a better way to find these version numbers. Went through my Manifest by hand. Anyway, another ] up seems to have resolved that issue... Converting all columns to Vector{Any} works now. Queries to the original DataFrame still cause julia to hang and steadily increase memory allocation.