Column types consistency issue - missing vs NaN

tk3369 / SASLib.jl

Julia library for reading SAS7BDAT data sets

Other

34 stars 7 forks source link

Column types consistency issue - missing vs NaN #32

Closed tbeason closed 6 years ago

tbeason commented 6 years ago

As it stands, the only columns that accept missing values are date and datetime columns. Any other columns with a missing value (except strings I guess) get parsed as Float. For example, if a column of Int has a missing element right now, it gets parsed as NaN and so the column is forced to be Float since there is no native NaN integer element. The fact that some missing values get parsed as NaN (incorrectly, I would argue) but others get parsed as missing presents an inconsistency.

I believe SAS does store information about the element type of the column. Even if it does not, simple csv/text parsing logic could be used to infer the type of the element as best we can.

Therefore, I propose that we begin to parse all missing elements as missing to be consistent within the package, with how they are treated in SAS, and with the majority of the data ecosystem in Julia.

tk3369 commented 6 years ago

Strangely enough, SAS7BDAT files only have two types, Character or Numeric. I can't tell if the column is Int vs Float. So we are facing two options, treating them as either NaN or missing.

It also appears that SAS does not distinguish NaN from missing data. See last page from https://cran.r-project.org/web/packages/sas7bdat/vignettes/sas7bdat.pdf

Whether it's missing or NaN, they're going to choke on any math calculations unless you skip them. You may wonder why Date and DateTime columns get special treatment. It's because NaN isn't a valid value for those columns.

I do agree that missing is more consistent. However, I was told fairly clearly that missing isn't performant and it was even suggested to be removed until v0.7 (see #9).

tbeason commented 6 years ago

You're right - I don't know how I had forgotten this. Most of what SAS does is just formatting. That does complicate things (perhaps considerably) I guess.

My general comment still stands. I'm not sure how slow missing is on v0.6.2 when compared to v0.7, but I also don't think that is too important. Eventually, we will be beyond both of those versions.

Perhaps my comment could be made a bit more broad: Is there a way to infer the column types correctly (according to Julia's type tree)? I suppose Character to String is easy. Suppose, the SAS file tells us it is a numeric column. Can we implement some logic that dynamically detects columns of Bool, Int, Float?

The missing vs NaN then becomes a moot point because NaN is exclusive to Float.

tk3369 commented 6 years ago

Perhaps my comment could be made a bit more broad: Is there a way to infer the column types correctly (according to Julia's type tree)? I suppose Character to String is easy. Suppose, the SAS file tells us it is a numeric column. Can we implement some logic that dynamically detects columns of Bool, Int, Float?

All numbers are stored in IEEE floating point format and hence they must be read in as Float64. After that, we could play smart and try to convert them into Int unless it encounters an InexactError. It sounds like a nice feature (as I was annoyed by it as well) but I think it may warrant to be a user option - call it something like auto_convert_integers.

Turning it into a Bool would be tricky and perhaps even undesirable -- just because a column happens to have zeros and ones only doesn't mean it's a boolean column.

SAS does not distinguish empty string from missing data. The current convention is to keep them as empty strings. It's probably less controversial here since nobody really cares about the difference, do you?

tbeason commented 6 years ago

All numbers are stored in IEEE floating point format and hence they must be read in as Float64. After that, we could play smart and try to convert them into Int unless it encounters an InexactError. It sounds like a nice feature (as I was annoyed by it as well) but I think it may warrant to be a user option - call it something like auto_convert_integers.

I think having it as an option via keyword arg would be fine. Then if it encounters problems on a file for whatever reason (obviously, we'd like to limit these cases to be rare) it can be turned off.

SAS does not distinguish empty string from missing data. The current convention is to keep them as empty strings. It's probably less controversial here since nobody really cares about the difference, do you?

Absolutely not haha. I try to avoid string columns like the plague unless I created them myself. Empty string is probably preferred.

tk3369 commented 6 years ago

Closing this issue as #37 will address the gap.