Closed fuhrysteve closed 9 years ago
Btw @mholt I'd be happy to do the leg work of integrating it and create a pull request (can probably port the unit tests from python stdlib, since it's basically the same code). Just don't want to go to the trouble if you think it's out of scope / better suited for a standalone library / plugin.
I think it could work quite well, though. I envision something along the lines of:
results = Papa.parse(csvString, {header: 'auto'});
if (results.meta.hasHeader) {
// detected headers!
console.log("first value is " + results.data[0]['Column 1']);
} else {
console.log("first value is " + results.data[0][0]);
}
I think there are also some considerable improvements we could make to python stdlib's approach, such as detecting additional types to consider i.e.:
if (!isNaN(Date.parse(sample[y][x]))) {
columnTypes[x] = 'date';
}
Thanks for the details and interest to improve the project! I finally had a chance to look this over, and am interested in it because header row detection is something we used to have to deal with at work and it's something that people occasionally request for this library. In my experience, it's infeasible to do it accurately (at least, accurately enough for our product at work). Header row detection works if you assume something about the rest of the data. For a completely general-purpose library, however, those assumptions can't always be made.
Another problem, which you actually addressed in your first follow-up post, is that the output format changes non-deterministically from the user's point of view, so they need to check some meta value every time they handle the results. In one case they may be handling an array of arrays, and in another, an array of objects.
At work, we dealt with this issue and tried to do header row detection. We eventually just gave up and required all inputs to have a header row. (Detecting which header fields mapped to which inputs in our service was much easier when we knew there would always be a header row.)
I realize that some may prefer the complete autonomy of auto-delimiter detection and header row detection, but where do we draw the line at trying to figure out the user's settings for them? I think I'm drawing the line for Papa Parse at delimiter detection... it complicates the library enough as it is.
So: with that said, I don't want to totally shut down the idea because I'm passionate about developing and using good algorithms to make things easier. This is one such algorithm that I think is a good start but could be improved. With substantial testing (with a wide variety of contrived and real-world inputs), perhaps it will suffice and maybe even be added to the library as a built-in feature. If header detection does work, I need to be convinced of it.
For the next version of Papa Parse, I'm looking at ways to make it more extensible, so you can handle errors diverse ways and also allow for extensions like automatic header row detection. Even though I'm not inclined to build it into the library, I'd be totally happy to see it as some sort of pre/post-processing!
If it's alright, I'm going to close this now, but feel free to discuss further if you'd like.
Sounds good - thanks for taking a look!
@fuhrysteve it seems to be that each row overwrites the values from the previous row:
...
} else {
// fall back to using length
columnTypes[x] = sample[y][x].length;
}
I look that it is like that in the python version as well. The thing that bothers me is that column "x" might contain strings with various lengths, but here it is always saving the last value from "y" row. What am I missing here ?
This is how columnTypes
changes during the sample inspection:
{0: 4, 1: null, 2: null}
{0: 4, 1: 14, 2: null}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 5, 1: 14, 2: true}
{0: 5, 1: 0, 2: true}
{0: 5, 1: 0, 2: true}
{0: 8, 1: 0, 2: true}
{0: 8, 1: 14, 2: true}
{0: 8, 1: 14, 2: true}
{0: 13, 1: 14, 2: true}
{0: 13, 1: 14, 2: true}
{0: 13, 1: 14, 2: true}
{0: 11, 1: 14, 2: true}
{0: 11, 1: 0, 2: true}
{0: 11, 1: 0, 2: true}
{0: 4, 1: 0, 2: true}
{0: 4, 1: 14, 2: true}
{0: 4, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
{0: 6, 1: 14, 2: true}
{0: 6, 1: 0, 2: true}
{0: 6, 1: 0, 2: true}
{0: 7, 1: 0, 2: true}
{0: 7, 1: 14, 2: true}
{0: 7, 1: 14, 2: true}
At the end, the columnTypes
used is {0: 7, 1: 14, 2: true}
. It seems to me that all the previous string lengths are ignored and discarded.
I recently ported python's
csv.Sniffer.has_header
to javascript. Similar to your use case at SmartyStreets, I'm working on an interface where the user is instructed to paste text from a spreadsheet into a<textarea>
, and from there I try to make sense of it on the client-side (mapping headers if available, etc).Would this sort of functionality be a good fit for PapaParse?
Obviously this approach is not perfect, but I find that it works "good enough" for many general cases -- good enough, anyways, to be included in the python standard library.
python docs https://docs.python.org/3.4/library/csv.html#csv.Sniffer.has_header python implementation (source code) https://hg.python.org/cpython/file/e119343bc3ec/Lib/csv.py#l384
My (tested, but not yet unit tested) implementation: