Is there an option to tell the Miso Dataset CSV parser to look at all the values before determining that a column should be of type other than String?
The CSV parser can be brittle in practice for general datasets, not looking far enough down to detect the row types and then unforgiving about type mismatches, failing with a hard error.
I keep running into errors of the form "Uncaught incorrect value 'X' of type string passed to column 'Y with type number". This happens in cases where
A column has values, ( e.g. `andY`) not allowing "Y" in a boolean type column
A column has ICD9 codes that are often numeric (e.g. 40100, and fails on the first V code (e.g. V0100)
A column has postal codes, which look like a number (e.g.02138) but are really strings, and fails on the first Canadian codes (e.g. K1A0B1)
A column has medical ID numbers, which look like numbers for most physicians (e.g. 00002348938) but include alphanumerics for nurses and other HCPs.
Many stats packages look at the first 100 rows by default, and have an option to scan more or even all rows before assessing column type.
* Update *
I see builder.js line 23 has the code, so I just need to find a way to parameterize the 5:
var type = _.inject(data.slice(0, 5), function(memo, value) {
Created a quick patch to always scan all the values. Since a type mismatch is a fatal error, seems more appropriate to make a complete scan the default, and make a partial scan an option.
Is there an option to tell the Miso Dataset CSV parser to look at all the values before determining that a column should be of type other than String?
The CSV parser can be brittle in practice for general datasets, not looking far enough down to detect the row types and then unforgiving about type mismatches, failing with a hard error.
I keep running into errors of the form "Uncaught incorrect value 'X' of type string passed to column 'Y with type number". This happens in cases where
`and
Y`) not allowing "Y" in a boolean type column40100
, and fails on the first V code (e.g.V0100
)02138
) but are really strings, and fails on the first Canadian codes (e.g.K1A0B1
)00002348938
) but include alphanumerics for nurses and other HCPs.Many stats packages look at the first 100 rows by default, and have an option to scan more or even all rows before assessing column type.
* Update * I see
builder.js
line 23 has the code, so I just need to find a way to parameterize the5
:var type = _.inject(data.slice(0, 5), function(memo, value) {
Created a quick patch to always scan all the values. Since a type mismatch is a fatal error, seems more appropriate to make a complete scan the default, and make a partial scan an option.
https://github.com/gradualstudent/dataset/tree/master/dist