mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.44k stars 1.14k forks source link

python's csv.Sniffer.has_header #104

Closed fuhrysteve closed 9 years ago

fuhrysteve commented 9 years ago

I recently ported python's csv.Sniffer.has_header to javascript. Similar to your use case at SmartyStreets, I'm working on an interface where the user is instructed to paste text from a spreadsheet into a <textarea>, and from there I try to make sense of it on the client-side (mapping headers if available, etc).

Would this sort of functionality be a good fit for PapaParse?

Obviously this approach is not perfect, but I find that it works "good enough" for many general cases -- good enough, anyways, to be included in the python standard library.

python docs https://docs.python.org/3.4/library/csv.html#csv.Sniffer.has_header python implementation (source code) https://hg.python.org/cpython/file/e119343bc3ec/Lib/csv.py#l384

My (tested, but not yet unit tested) implementation:

/**
 * JavaScript port of python stdlib's csv.Sniffer.has_header
 *
 * @author Stephen Fuhry <fuhrysteve@gmail.com>
 * @see {@link https://hg.python.org/cpython/file/e119343bc3ec/Lib/csv.py#l384}
 */
 var HasHeader = function(sample) {

    if (sample.length < 1) {
        return false;
    }
    var header = sample[0];
    var num_columns = header.length;
    var columnTypes = {};
    for (var i=0; i<num_columns; i++) {
        columnTypes[i] = null;
    }
    var checked = 0;
    var times = Math.min(21, sample.length);
    for (var y=1; y < times; y++) {
        var colLength = sample[y].length;
        if (colLength !== num_columns) continue;

        for (var x=0; x < colLength; x++) {
            // "is it a number?" check
            if ((sample[y][x] - parseFloat( sample[y][x] ) + 1) >= 0) {
                if (columnTypes[x] !== true) {
                    if (columnTypes[x] === null) {
                        // column is numeric!
                        columnTypes[x] = true;
                    } else {
                        // column is inconsistent.. remove it from consideration
                        delete columnTypes[x];
                    }
                }
            } else {
                // fall back to using length
                columnTypes[x] = sample[y][x].length;
            }
        }
    }
    var hasHeader = 0;
    // Time to "vote" on what's a header!
    for (var t in columnTypes) {
        // if it's a number, it's a string length
        if (typeof columnTypes[t] === 'number') {
            if (header[t].length === columnTypes[t]) {
                hasHeader--;
            } else {
                hasHeader++;
            }
        // true means the column type is numeric
        } else if (columnTypes[t] === true) {
            if ((header[t] - parseFloat( header[t] ) + 1) >= 0) {
                hasHeader--;
            } else {
                hasHeader++;
            }
        }
    }
    return hasHeader > 0;
}

// usage:
var sample = [
    ['name','phone','age'],
    ['jack','(440) 234-2222','29'],
    ['jill','(216) 234-3235','42'],
    ['frank','','52'],
    ['bob hope','(216) 234-3235','52'],
    ['ronald reagan','(216) 234-3235','7'],
    ['ronald paul','','41'],
    ['john','(216) 234-3235','24'],
    ['william','(216) 234-3235','15'],
    ['venice','','11'],
    ['someone','(216) 234-3235','87']
];

HasHeader(sample);
> true

// remove first row
sample.shift();

HasHeader(sample);
> false
fuhrysteve commented 9 years ago

Btw @mholt I'd be happy to do the leg work of integrating it and create a pull request (can probably port the unit tests from python stdlib, since it's basically the same code). Just don't want to go to the trouble if you think it's out of scope / better suited for a standalone library / plugin.

I think it could work quite well, though. I envision something along the lines of:

results = Papa.parse(csvString, {header: 'auto'});
if (results.meta.hasHeader) {
    // detected headers!
    console.log("first value is " + results.data[0]['Column 1']);
} else {
    console.log("first value is " + results.data[0][0]);
}
fuhrysteve commented 9 years ago

I think there are also some considerable improvements we could make to python stdlib's approach, such as detecting additional types to consider i.e.:

if (!isNaN(Date.parse(sample[y][x]))) {
    columnTypes[x] = 'date';
}
mholt commented 9 years ago

Thanks for the details and interest to improve the project! I finally had a chance to look this over, and am interested in it because header row detection is something we used to have to deal with at work and it's something that people occasionally request for this library. In my experience, it's infeasible to do it accurately (at least, accurately enough for our product at work). Header row detection works if you assume something about the rest of the data. For a completely general-purpose library, however, those assumptions can't always be made.

Another problem, which you actually addressed in your first follow-up post, is that the output format changes non-deterministically from the user's point of view, so they need to check some meta value every time they handle the results. In one case they may be handling an array of arrays, and in another, an array of objects.

At work, we dealt with this issue and tried to do header row detection. We eventually just gave up and required all inputs to have a header row. (Detecting which header fields mapped to which inputs in our service was much easier when we knew there would always be a header row.)

I realize that some may prefer the complete autonomy of auto-delimiter detection and header row detection, but where do we draw the line at trying to figure out the user's settings for them? I think I'm drawing the line for Papa Parse at delimiter detection... it complicates the library enough as it is.

So: with that said, I don't want to totally shut down the idea because I'm passionate about developing and using good algorithms to make things easier. This is one such algorithm that I think is a good start but could be improved. With substantial testing (with a wide variety of contrived and real-world inputs), perhaps it will suffice and maybe even be added to the library as a built-in feature. If header detection does work, I need to be convinced of it.

For the next version of Papa Parse, I'm looking at ways to make it more extensible, so you can handle errors diverse ways and also allow for extensions like automatic header row detection. Even though I'm not inclined to build it into the library, I'd be totally happy to see it as some sort of pre/post-processing!

mholt commented 9 years ago

If it's alright, I'm going to close this now, but feel free to discuss further if you'd like.

fuhrysteve commented 9 years ago

Sounds good - thanks for taking a look!

kktsvetkov commented 4 years ago

@fuhrysteve it seems to be that each row overwrites the values from the previous row:

...
            } else {
                // fall back to using length
                columnTypes[x] = sample[y][x].length;
            }

I look that it is like that in the python version as well. The thing that bothers me is that column "x" might contain strings with various lengths, but here it is always saving the last value from "y" row. What am I missing here ?

This is how columnTypes changes during the sample inspection:

{0: 4, 1: null, 2: null}
{0: 4, 1: 14, 2: null}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 5, 1: 14, 2: true}
{0: 5, 1: 0, 2: true}
{0: 5, 1: 0, 2: true}
{0: 8, 1: 0, 2: true}
{0: 8, 1: 14, 2: true}
{0: 8, 1: 14, 2: true}
{0: 13, 1: 14, 2: true}
{0: 13, 1: 14, 2: true}
{0: 13, 1: 14, 2: true}
{0: 11, 1: 14, 2: true}
{0: 11, 1: 0, 2: true}
{0: 11, 1: 0, 2: true}
{0: 4, 1: 0, 2: true}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
{0: 6, 1: 14, 2: true}
{0: 6, 1: 0, 2: true}
{0: 6, 1: 0, 2: true}
{0: 7, 1: 0, 2: true}
{0: 7, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}

At the end, the columnTypes used is {0: 7, 1: 14, 2: true}. It seems to me that all the previous string lengths are ignored and discarded.