Closed tbeason closed 6 years ago
Strangely enough, SAS7BDAT files only have two types, Character or Numeric. I can't tell if the column is Int
vs Float
. So we are facing two options, treating them as either NaN
or missing
.
It also appears that SAS does not distinguish NaN
from missing
data. See last page from https://cran.r-project.org/web/packages/sas7bdat/vignettes/sas7bdat.pdf
Whether it's missing
or NaN
, they're going to choke on any math calculations unless you skip them. You may wonder why Date
and DateTime
columns get special treatment. It's because NaN
isn't a valid value for those columns.
I do agree that missing
is more consistent. However, I was told fairly clearly that missing
isn't performant and it was even suggested to be removed until v0.7 (see #9).
You're right - I don't know how I had forgotten this. Most of what SAS does is just formatting. That does complicate things (perhaps considerably) I guess.
My general comment still stands. I'm not sure how slow missing
is on v0.6.2 when compared to v0.7, but I also don't think that is too important. Eventually, we will be beyond both of those versions.
Perhaps my comment could be made a bit more broad: Is there a way to infer the column types correctly (according to Julia's type tree)? I suppose Character to String
is easy. Suppose, the SAS file tells us it is a numeric column. Can we implement some logic that dynamically detects columns of Bool
, Int
, Float
?
The missing
vs NaN
then becomes a moot point because NaN
is exclusive to Float
.
Perhaps my comment could be made a bit more broad: Is there a way to infer the column types correctly (according to Julia's type tree)? I suppose Character to String is easy. Suppose, the SAS file tells us it is a numeric column. Can we implement some logic that dynamically detects columns of Bool, Int, Float?
All numbers are stored in IEEE floating point format and hence they must be read in as Float64. After that, we could play smart and try to convert them into Int
unless it encounters an InexactError. It sounds like a nice feature (as I was annoyed by it as well) but I think it may warrant to be a user option - call it something like auto_convert_integers
.
Turning it into a Bool
would be tricky and perhaps even undesirable -- just because a column happens to have zeros and ones only doesn't mean it's a boolean column.
SAS does not distinguish empty string from missing data. The current convention is to keep them as empty strings. It's probably less controversial here since nobody really cares about the difference, do you?
All numbers are stored in IEEE floating point format and hence they must be read in as Float64. After that, we could play smart and try to convert them into Int unless it encounters an InexactError. It sounds like a nice feature (as I was annoyed by it as well) but I think it may warrant to be a user option - call it something like auto_convert_integers.
I think having it as an option via keyword arg would be fine. Then if it encounters problems on a file for whatever reason (obviously, we'd like to limit these cases to be rare) it can be turned off.
SAS does not distinguish empty string from missing data. The current convention is to keep them as empty strings. It's probably less controversial here since nobody really cares about the difference, do you?
Absolutely not haha. I try to avoid string columns like the plague unless I created them myself. Empty string is probably preferred.
Closing this issue as #37 will address the gap.
As it stands, the only columns that accept
missing
values are date and datetime columns. Any other columns with a missing value (except strings I guess) get parsed asFloat
. For example, if a column ofInt
has a missing element right now, it gets parsed asNaN
and so the column is forced to beFloat
since there is no nativeNaN
integer element. The fact that some missing values get parsed asNaN
(incorrectly, I would argue) but others get parsed asmissing
presents an inconsistency.I believe
SAS
does store information about the element type of the column. Even if it does not, simple csv/text parsing logic could be used to infer the type of the element as best we can.Therefore, I propose that we begin to parse all missing elements as
missing
to be consistent within the package, with how they are treated in SAS, and with the majority of the data ecosystem in Julia.