tk3369 / SASLib.jl

Julia library for reading SAS7BDAT data sets
Other
34 stars 7 forks source link

Ability to convert column to specific types given by the user #42

Closed tk3369 closed 6 years ago

tk3369 commented 6 years ago

for issue #37

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.1%) to 95.53% when pulling 563b321853bf185bc00d5b77aa598981b4d5f633 on tk/user-specified-column-type into e1e39f07a9ef09bbca35cab7557dc3bbca7aa57b on master.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-1.3%) to 94.08% when pulling 84d7fa22fd6e78086a413c91f5482fbf2eaae48b on tk/user-specified-column-type into e1e39f07a9ef09bbca35cab7557dc3bbca7aa57b on master.

codecov[bot] commented 6 years ago

Codecov Report

Merging #42 into master will increase coverage by 0.1%. The diff coverage is 97.5%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #42     +/-   ##
=========================================
+ Coverage   95.42%   95.53%   +0.1%     
=========================================
  Files           7        8      +1     
  Lines         743      783     +40     
=========================================
+ Hits          709      748     +39     
- Misses         34       35      +1
Impacted Files Coverage Δ
src/Types.jl 100% <ø> (ø) :arrow_up:
src/CIDict.jl 100% <100%> (ø)
src/SASLib.jl 94.69% <91.66%> (-0.06%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e1e39f0...563b321. Read the comment docs.

tk3369 commented 6 years ago

@tbeason want to take a look before I merge?

tbeason commented 6 years ago

I can check it out tomorrow. This will be a super helpful feature!

On Sun, Apr 8, 2018, 12:19 PM Tom Kwong notifications@github.com wrote:

@tbeason https://github.com/tbeason want to take a look before I merge?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tk3369/SASLib.jl/pull/42#issuecomment-379575104, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQRwZ5E3eVLDE-pJAyfzldg2l-rrNLkks5tmmLLgaJpZM4TCtIa .

tbeason commented 6 years ago

So it works from what I can tell (if the columns do not have missing values). I got warnings when I supplied conversions that I expected to fail, so that's good.

It does fail for Union types, even if the true type is a member of the union. For example, if the column truly contains only Int, I would have guessed that requesting Union{Int,Float64} would work, but it does not.

tk3369 commented 6 years ago

Thanks. Is this what you're referring to? Looks like broadcast fails to keep the Union.

julia> convert(Union{Int,Float64}, 10.0)
10.0

julia> typeof(convert(Union{Int,Float64}, 10.0))
Float64

julia> convert.(Union{Int,Float64}, [10.0, 20.0])
2-element Array{Float64,1}:
 10.0
 20.0
tk3369 commented 6 years ago

Never mind. I replicated the problem.

julia> rs = readsas("productsales.sas7bdat", column_types=Dict(:ACTUAL=>Union{Int,Float64}))
ERROR: MethodError: Cannot `convert` an object of type Type{Union{Float64, Int64}} to an object of type DataType
This may have arisen from a call to the constructor DataType(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] convert(::Type{Dict{Symbol,DataType}}, ::Dict{Symbol,Union}) at ./dict.jl:201
 [2] #readsas#16(::String, ::Bool, ::Array{Any,1}, ::Array{Any,1}, ::Dict{Any,Any}, ::Dict{Any,Any}, ::Dict{Symbol,Union}, ::Int64, ::SASLib.#readsas, ::String) at /Users/tomkwong/iCloudDocs/Programming/Julia/SASLib/src/SASLib.jl:162
 [3] (::SASLib.#kw##readsas)(::Array{Any,1}, ::SASLib.#readsas, ::String) at ./<missing>:0

I just realized that Union is not a DataType but a Type. I will fix that later this week.

julia> subtypetree(Type)
Type
    Core.TypeofBottom
    DataType
    Union
    UnionAll
tk3369 commented 6 years ago

Fixed.

julia> rs = readsas("productsales.sas7bdat", column_types=Dict(:YEAR=>Union{Int,Missing}))
Read productsales.sas7bdat with size 1440 x 10 in 0.40709 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993, 1993-02-01
3: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993, 1993-03-01
4: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993, 1993-04-01
5: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993, 1993-05-01
⋮

julia> typeof(rs[:YEAR])
Array{Union{Int64, Missings.Missing},1}
tbeason commented 6 years ago

I'll check this again in the next couple of days.

tk3369 commented 6 years ago

@tbeason ping

tbeason commented 6 years ago

Seems to work fine now. Sorry for the delay. I'd say this is probably good to go.

I do wonder now how far of a stretch it really is to just implement the NaN -> missing conversion. Would be a couple of lines of additional code in the column type converter function. I'm not doing much data work at moment, but I'll probably end up adding this functionality locally at some point. Then I'll make a PR and you can look it over and see if that's something you want.

tk3369 commented 6 years ago

I do wonder now how far of a stretch it really is to just implement the NaN -> missing conversion. Would be a couple of lines of additional code in the column type converter function. I'm not doing much data work at moment, but I'll probably end up adding this functionality locally at some point. Then I'll make a PR and you can look it over and see if that's something you want.

Thanks for spending time testing. Your contribution is very welcome. To track the enhancement request, I've created a new issue #43.