Open ezgikurt opened 5 years ago
They are no longer experimental and should just work, so this certainly appears to be a bug.
Could you post the exact code you are using?
Sure, here it is the current code I'm trying out. Normally I use multiple conditions etc. but even this is not working.
using Queryverse
df = DataFrame(load("./data/ibes_adj.dta"));
df |> @filter(_.ticker == "AAPL")
It gets stuck in the last line. I've waited upwards of 20 minutes. The same issue is here if I do using Query, DataFrames, StatFiles
.
However I have an interesting find. I also have the same dataset in .csv format and when I load it from the CSV @ filter works. typeof(df)
returns DataFrame in both instances. (loading from a .csv takes much longer than loading from .dta though).
The same code works perfectly fine in 0.6.4.
ps. @davidanthoff btw thanks for the intro to queryverse video, it completely changed how i approach this particular dataset!
Any chance that you could share the file? If you can't upload it here for some reason, maybe you could email it to me? My email is on my profile page at https://github.com/davidanthoff.
@davidanthoff I've sent it via email.
Thanks, got the file! It is quite mysterious what is happening there. The issue is somewhere in Tables.jl, as far as I can tell, and unrelated to Query.jl, but I don't yet fully understand what the problem is (at least that is what I think right now, given that df |> getiterator |> collect
also hangs forever, which should not exercise any Query.jl related stuff).
One way around this is to just skip the DataFrame materialization entirely and thereby avoid the code path through Tables.jl: df = load("./data/ibes_adj.dta") |> @filter(_.ticker == "AAPL")
works for me.
Hi @davidanthoff, thanks for the reply & sorry for the late comment. Unfortunately I am relying on the DataFrames too much in other parts of the script to be able to skip it, but I can use the .csv file for now.
I'd also like to add that this issue might be endemic to that specific dataset, because I have another .dta that is larger (190k rows instead of 140k) and does not have the same issue.
I can reproduce this error with a DataFrame of my own as well. Both LINQ style and stand-alone operators hang. ( all that I have tried so far )
I initially thought it might be due to some funky column types but when I converted
all columns to Vector{Any}
I got a completely different error on a query:
julia> df = DataFrame(a = Any[rand(5)...], b=Any[rand(5)...])
5×2 DataFrame
│ Row │ a │ b │
│ │ Any │ Any │
├─────┼───────────┼───────────┤
│ 1 │ 0.481095 │ 0.754002 │
│ 2 │ 0.986754 │ 0.65757 │
│ 3 │ 0.387108 │ 0.318604 │
│ 4 │ 0.0094196 │ 0.920775 │
│ 5 │ 0.283106 │ 0.0891282 │
julia> df |> @filter(_.a > 0.3) |> DataFrame
ERROR: UndefVarError: DataValueUnwrapper not defined
Stacktrace:
[1] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Array{NamedTuple{(:a, :b),Tuple{Any,Any}},1}}},getfield(Main, Symbol("##38#40"))}) at /home/jonas/.julia/packages/DataFrames/z2XOB/src/other/tables.jl:27
[2] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{DataValues.DataValue{Any},DataValues.DataValue{Any}}},Array{NamedTuple{(:a, :b),Tuple{Any,Any}},1}}},getfield(Main, Symbol("##38#40"))}, ::Type) at ./operators.jl:813
[3] top-level scope at none:0
EDIT: the same error shows up without converting to Vector{Any}
.
julia> df2 = DataFrame(a = rand(5), b=rand(5))
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.726983 │ 0.214645 │
│ 2 │ 0.201247 │ 0.848875 │
│ 3 │ 0.344053 │ 0.374592 │
│ 4 │ 0.367191 │ 0.710325 │
│ 5 │ 0.231486 │ 0.0913907 │
julia> df2 |> @filter(_.a > 0.3) |> DataFrame
ERROR: UndefVarError: DataValueUnwrapper not defined
Stacktrace:
[1] DataFrame(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Float64,Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Array{NamedTuple{(:a, :b),Tuple{Float64,Float64}},1}}},getfield(Main, Symbol("##66#68"))}) at /home/jonas/.julia/packages/DataFrames/z2XOB/src/other/tables.jl:27
[2] |>(::QueryOperators.EnumerableFilter{NamedTuple{(:a, :b),Tuple{Float64,Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Float64,Float64}},Array{NamedTuple{(:a, :b),Tuple{Float64,Float64}},1}}},getfield(Main, Symbol("##66#68"))}, ::Type) at ./operators.jl:813
[3] top-level scope at none:0
@JonasIsensee I think what you are seeing must be a different issue. What versions of packages are you using? I can't reproduce your latest code example.
julia v1.0.3
Query v0.11.0
DataFrames v0.16.0
Tables v0.1.15
TableTraits v0.4.1
QueryOperators v0.7.0
IterableTables v0.10.0
DataValues v0.4.7
There's got to be a better way to find these version numbers. Went through my Manifest by hand.
Anyway, another ] up
seems to have resolved that issue...
Converting all columns to Vector{Any}
works now.
Queries to the original DataFrame still cause julia to hang
and steadily increase memory allocation.
Hi, I have this issue in both 1.0.3 and 1.1.0 (I'm using Windows 10). I load a .dta file into a DataFrame and then try filtering it using @filter(_.) macro but it seems to get stuck until infinity unless i ctrl + c and force an exit. The same df works fine in 0.6.4 but loading the data takes much longer. I know that these are experimental functionalities, are they not available on Julia 1.*? @map seems to work fine for example (using on data loaded from CSV).