Open johannspies opened 5 years ago
This was a very interesting bug! The bug here is that this uncovered a whole bunch of problems that come up in the diagnostic display if parsing fails. I think I fixed all of them in https://github.com/JuliaComputing/TextParse.jl/pull/114. With that PR, things still don't work, but one gets a slightly more helpful error message:
julia> load("test.csv") |> DataFrame
MethodError: Cannot `convert` an object of type Missing to an object of type TextParse.StrRange
Closest candidates are:
convert(::Type{S}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} at C:\Users\david\.julia\packages\CategoricalArrays\ucKV2\src\value.jl:91
convert(::Type{T}, ::T) where T at essentials.jl:154
TextParse.StrRange(::Any, ::Any) at C:\Users\david\.julia\dev\TextParse\src\util.jl:317
ERROR: CSV parsing error in test.csv at line 29 char 79:
...án,Bhoutan,Бутан,BT,BTN,64,"INR,BTN",BHUTAN,"2,2","Indian Rupee,Ngultrum","356,064",64,مملك...
____________________________________________________^
column 12 is expected to be: TextParse.Field{Union{Missing, Int64},TextParse.NAToken{Union{Missing, Int64},TextParse.Numeric{Int64}}}(<Int64>?, true, true, false)
What is happening here is that the type detection algorithm classifies column 12 (and I believe 14 as well) as Int
, but then line 26 has a string value for that column. We currently cannot recover from a situation where a column as originally classified as Int
and then turns out to be String
halfway through the parsing.
Two options to solve this for now: 1) you can manually specify that these columns should be parsed as String
, by doing load(filename, colparsers=Dict(12=>String, 14=>String))
. Or you can simply increase the number of rows that should be used for column type detection slightly to something larger than 20 (the default): load(filename, type_detect_rows=30)
should do the trick.
I do have a plan to make this more robust in general, i.e. a way to recover if the type detection fails (which can always happen, even if one samples more lines), but it will be a while until that is done.
And this file also highlights that our default table printing code all messes up the width when there is some serious unicode there :)
Hi @johannspies 👋
Coincidentally, I ran into the same issue with the same data set 😅
Since we were looking at the same data set, I thought I'd let you know I just created these two repos:
based off this data set 😊
A bit of explanation and request for feedback can be found here https://github.com/JuliaFinance/Roadmap/issues/5
Cheers 😊
With CSVFiles
If I truncate the file to the first 26 lines, CSVFiles reads it without a problem. Below the first 27 lines (that causes the problem) as an example of the data: