heavyai / heavyai.jl

Julia client for OmniSci GPU-accelerated SQL engine and analytics platform
https://omnisci.github.io/OmniSci.jl/latest/
Other
21 stars 9 forks source link

need floor before Int64 #80

Closed snthot closed 4 years ago

snthot commented 4 years ago

I worked on Julia Version 1.2.0 (2019-08-20) with omnisci/core-os-cpu docker.

I bumped into trouble when applying TColumn on Datetime column. I think we need "floor" before converting into Int64 and it works well.

No obligation, just want to have it worked well :-)

codecov[bot] commented 4 years ago

Codecov Report

Merging #80 into master will increase coverage by 61.7%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #80      +/-   ##
=========================================
+ Coverage    23.7%   85.5%   +61.7%     
=========================================
  Files          11       6       -5     
  Lines        2563     221    -2342     
=========================================
- Hits          609     189     -420     
+ Misses       1954      32    -1922
Impacted Files Coverage Δ
src/constructors.jl 90.2% <100%> (ø) :arrow_up:
src/mapd/MapD.jl
src/mapd/mapd_types.jl
src/completion_hints/completion_hints_types.jl
...rialized_result_set/serialized_result_set_types.jl
src/common/common_types.jl

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ab11d57...0842885. Read the comment docs.

randyzwitch commented 4 years ago

Thanks for checking out the package @snthot! Can you provide a brief example that demonstrates the error you saw? Prior to this fix, the test suite passed with this functionality, so I want to make sure I'm accounting for any edge cases that might be present.

snthot commented 4 years ago

oh sure, here is the case: I have a DateTime column in a DataFrame which filled with now() function. I got error when trying to convert the column to TColumn. Please see below.

julia> using DataFrames

julia> using OmniSci

julia> using Dates

julia> df1 = DataFrame(A=[1,2], B=[DateTime(2013,7,1,12,30,59), DateTime(2013,7,1,12,30,59)])
2×2 DataFrame
│ Row │ A     │ B                   │
│     │ Int64 │ DateTime            │
├─────┼───────┼─────────────────────┤
│ 1   │ 1     │ 2013-07-01T12:30:59 │
│ 2   │ 2     │ 2013-07-01T12:30:59 │

julia> df2 = DataFrame(A=[1,2], B=[DateTime(2013,7,1,12,30,59), now()])
2×2 DataFrame
│ Row │ A     │ B                       │
│     │ Int64 │ DateTime                │
├─────┼───────┼─────────────────────────┤
│ 1   │ 1     │ 2013-07-01T12:30:59     │
│ 2   │ 2     │ 2019-10-30T16:06:57.018 │

julia> TColumn.(eachcol(df1))
2-element Array{TColumn,1}:
 TColumn(OmniSci.TColumnData([1, 2], #undef, #undef, #undef), Bool[false, false])
 TColumn(OmniSci.TColumnData([1372681859, 1372681859], #undef, #undef, #undef), Bool[false, false])

julia> TColumn.(eachcol(df2))
ERROR: InexactError: Int64(1.572451617018e9)
Stacktrace:
 [1] Type at ./float.jl:703 [inlined]
 [2] myInt64 at /home/santi/.julia/packages/OmniSci/OmRhl/src/constructors.jl:38 [inlined]
 [3] _broadcast_getindex_evalf at ./broadcast.jl:578 [inlined]
 [4] _broadcast_getindex at ./broadcast.jl:551 [inlined]
 [5] getindex at ./broadcast.jl:511 [inlined]
 [6] macro expansion at ./broadcast.jl:843 [inlined]
 [7] macro expansion at ./simdloop.jl:73 [inlined]
 [8] copyto! at ./broadcast.jl:842 [inlined]
 [9] copyto! at ./broadcast.jl:797 [inlined]
 [10] copy at ./broadcast.jl:773 [inlined]
 [11] materialize at ./broadcast.jl:753 [inlined]
 [12] TColumn(::Array{DateTime,1}) at /home/santi/.julia/packages/OmniSci/OmRhl/src/constructors.jl:325
 [13] _broadcast_getindex_evalf at ./broadcast.jl:578 [inlined]
 [14] _broadcast_getindex at ./broadcast.jl:551 [inlined]
 [15] getindex at ./broadcast.jl:511 [inlined]
 [16] copyto_nonleaf!(::Array{TColumn,1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},Type{TColumn},Tuple{Base.Broadcast.Extruded{DataFrames.DataFrameColumns{DataFrame,AbstractArray{T,1} where T},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:928
 [17] copy at ./broadcast.jl:791 [inlined]
 [18] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,Type{TColumn},Tuple{DataFrames.DataFrameColumns{DataFrame,AbstractArray{T,1} where T}}}) at ./broadcast.jl:753
 [19] top-level scope at none:0

I noticed that your test pass because the DateTime data are defined down to second(s) level but the now() function normally go down to millisecond level so I got the trouble. And this prevented me from using 'load_table_binary_columnar' function. Once I fixed this problem by applying 'floor' I found that I can convert high precision datetime to TColumn (with losses) and load the data to OmniSci correctly.

I am not sure whether there is a better way to fix this. But it looks like Omnisci supports high precision timestamp so you may go with Float64 instead of Int64 ? Please advise.

randyzwitch commented 4 years ago

@snthot Thanks for the example, yes that's an oversight on my part. Your solution fixes the timestamp at the seconds level (which is what OmniSci supported when I implemented this), but it's a good question about how to implement higher-resolution timestamps

randyzwitch commented 4 years ago

I confirmed internally that we do want to cast to Int.

Could you make your change mydatetime2unix(x) = datetime2unix(floor(x, Dates.Second)) so that the operation here is explicit in rounding to the second? Also, if you could switch one of the values in the tests to use now() like it caused your error, that will help to not make this mistake in the future.

Separately, I'll open up an issue about supporting high-resolution timestamps.

Thanks!

snthot commented 4 years ago

great! I absolutely agree! thanks!