misoproject / dataset

JavaScript library that makes managing the data behind client-side visualisations easy
http://misoproject.com
GNU General Public License v2.0
1.18k stars 99 forks source link

min() and max() can't cope with NaN values? #120

Closed alanomaly closed 12 years ago

alanomaly commented 12 years ago

It seems like, if you use min() and max() on a number column that has any non-numerical values, they freak out, silently fail and give strange results.

It's neither unusual nor wrong for people compiling data sets to have the occasional non-numeric value in a number column. As well as commonplace formatting accidents like Excel adding an empty row to exported csv files, there are many cases where the occasional text string in a number column makes sense (e.g. "N/A" or "Unknown" or "Not significant" or "P>0.05" or "To be published 11/08/12" or "Redacted" or "Complete" or "Cancelled" or "*"...).

In my case I'm getting '0' as the max value in a column of numbers in a .csv file where the data maintainers (presumably exporting from Excel) have left a blank line at the bottom resulting in a 'NaN' value in the daaset column. If I make a local copy of the csv and remove the blank line in a text editor, max() behaves and gives the real max value - but this isn't a practical long-term workflow. If there's a string in there (including whitespace like ", ," instead of ",," in a csv), the whole thing crashes with an uncaught exception.

If the purpose of Dataset is to deal with the messy world of published data sets, surely min() and max() shouldn't just fail when a NaN value is present, and should instead give the actual min() and max() of the available numeric values? Right now other than the optional iterator I'm not sure what dataset.max(column.name); does that couldn't be done with Math.max.apply(Math, column.data);

(for anyone else experiencing this, a simplistic short-term workaround for the silently failing on blank rows part is it use something like this:

columnValues = getColumnValues( column );
max = Math.max.apply(Math, columnValues);

function getColumnValues( column ) {
    columnValues = []
    for (var i = column.data.length - 1; i >= 0; i--) {
        if(!isNaN(column.data[i])) {
            columnValues.push(column.data[i]); 
        }
    };
    return columnValues;
}
iros commented 12 years ago

Could you let me know which version of Dataset you are using? 0.1.0 or 0.1.1?

iros commented 12 years ago

Also wondering, in the case of a column containing data like: [0,2,3,4,5 "greater than 1000"] the max is actually the last value, but I have no way of computationally deriving that. If it says "less than 1000" that's ambiguous from that point. Would you be ok with just nulls/empty strings ignored? I could ignore all strings, but then the situation above could yield the wrong result.