mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.46k stars 1.14k forks source link

Rename headers from external data file #196

Closed mcshaman closed 9 years ago

mcshaman commented 9 years ago

It would be very useful to be able to change the names of headers before processing a datafile. We are often combining CSVs from different steak holders who use slightly different labels for same data. Combining these together would be as easy as a Array.concat() if all the property names in the data property were the same.

I can loop through every single item in a data set after it has been processed adding/deleting properties to get this result... But this seems very inefficient and I assume wouldn't be too hard to do within PapaParse.

bluej100 commented 9 years ago

Perhaps you could use this? https://github.com/mholt/PapaParse/pull/192

mcshaman commented 9 years ago

Could do... How do I test it?

mholt commented 9 years ago

If you're okay with just copy+paste, here is a link to the full file with those changes: https://raw.githubusercontent.com/bluej100/PapaParse/beforeFirstChunk/papaparse.js

I'll get around to these PRs as soon as school ends this month...

mcshaman commented 9 years ago

Call back does not seem to work for me. As soon as I include the callback, even if it has no logic in it, my results in the complete callback end up empty.

Papa.parse( './mydoc.csv', {
    delimiter: ',',
    download: true,
    beforeFirstChunk: function() {
        // no code needed to break this example
    },
    complete: function( result ) {
        console.log( result );
    }
} );
bluej100 commented 9 years ago

You need to return the modified first chunk. Sorry for the confusion.

mcshaman commented 9 years ago

Don't apologise I should have tried that :)

So yeah the following works:

Papa.parse( './mydoc.csv', {
    delimiter: ',',
    download: true,
    beforeFirstChunk: function( chunk ) {
        var rows = chunk.split( /\r\n|\r|\n/ );
        var headings = rows[0].split( ',' );
        headings[0] = 'newHeading';
        rows[0] = headings.join();
        return rows.join( '\n' );
    },
    complete: function( result ) {
        console.log( result );
    }
} );
bluej100 commented 9 years ago

Glad you were able to get it to work. It's a little convoluted, but I'm not sure that your use case will be common.

I expect it would be a bit more performant if you did this, fyi:

    var index = chunk.match( /\r\n|\r|\n/ ).index;
    var headings = chunk.substr(0, index).split( ',' );
    headings[0] = 'newHeading';
    return headings.join() + chunk.substr(index);
mcshaman commented 9 years ago

Thanks for your supper efficient code @bluej100

@mholt, unfortunately this version of PapaPars has started introducing server errors when trying to load an external CSV files. I have not yet had a chance to gather too much info. I was using it in a production environment and as soon as it started playing up I rolled back to the official distribution. This fixed the issue.

Would you like to to try and recreate the issues?

mholt commented 9 years ago

@mcshaman Now that I have a little time, YES, gladly. Also, I did a release this morning with all the changes from the last 4 months - but it's still just a patch release. See if it fixes your problem (without jeopardizing your production server!) and let me know.

Edit: Also, the beforeFirstChunk callback change was just merged into master (not included in the release from this morning though), just so you know.

sbrodehl commented 9 years ago

Not sure if it is related to the recent changes, but I'm getting an error when I try to use the beforeFirstChunk feature using the latest v4.1.1.

At first I tried to return the given chunk, just to check if the method works:

Papa.parse(file, {
    delimiter: ',',
    download: true,
    worker: true,
    fastMode: true,
    beforeFirstChunk: function(chunk) {
        return chunk;
    },
    chunk: function(chunk) {
         ...
    },
    complete: function() {
        ...
    }
);

But that doesn't work. I tried the example above (but without changing the chunk)

beforeFirstChunk: function(chunk) {
    var index = chunk.match( /\r\n|\r|\n/ ).index;
    var headings = chunk.substr(0, index).split( ',' );
    return headings.join() + chunk.substr(index);
}

but got the same error:

DataCloneError: The object could not be cloned.

in papaparse.js:192:0, thats in

function CsvToJson(_input, _config) {
    ...
}

Without beforeFirstChunk everything works fine, any ideas?

mholt commented 9 years ago

@sbrodehl You're using workers. I'm guessing that we overlooked something and it's trying to copy the beforeFirstChunk callback to the worker. Functions can't be copied to workers because workers operate in a completely different scope.

Hmm... the worker will need to copy the first chunk to the main thread, have it processed, then have it sent back to the worker.

bluej100 commented 9 years ago

Ugh, sorry for that oversight.

sbrodehl commented 9 years ago

Well, with workers disabled I can add the beforeFirstChunk method, but I get no response during the call? In fact the method isn't called at all.

beforeFirstChunk: function(chunk) {
    alert("beforeFirstChunk");
    console.log("beforeFirstChunk");
    return chunk;
},

Same with v4.1.0 by the way.

mholt commented 9 years ago

@bluej100 It's okay, I missed it too in my hurry to start catching up.

And actually, maybe another release is needed since 4.1.1 doesn't include the beforeFirstChunk feature - but the latest at HEAD does. I don't want to do a full 4.2 release until most of the remaining issues and PRs are closed, but if you try with what is currently on master it should work.

bluej100 commented 9 years ago

For what it's worth, @mcshaman , I got better performance when I switched from Papa Parser's worker feature to my own worker anyway, since I only needed summarized results. It saved a lot of message-passing overhead.

sbrodehl commented 9 years ago

@mholt Latest papaparse.js from HEAD works like a charm! But minified version is not available. Thanks!

mholt commented 9 years ago

That's true, I only build minified versions upon a release. But you can use jscompress.com in the meantime.

mcshaman commented 9 years ago

Hey @mholt , sorry for the delayed response. I just started using 4.1.1 and am still getting errors.

In Chrome the error is:

GET http://myserver.local/path/to/file.csv 400 (Bad Request)    papaparse.min.js:6
_readChunk    papaparse.min.js:6
_nextChunk    papaparse.min.js:6
stream    papaparse.min.js:6
t    papaparse.min.js:6
parseData    activity.html:126
(anonymous function)    activity.html:260
c    jquery-1.9.1.min.js:3
p.fireWith    jquery-1.9.1.min.js:3
b.extend.ready    jquery-1.9.1.min.js:3
H    jquery-1.9.1.min.js:3

And this is the response from another page

GET http://myserver.local/path/to/file.csv 400 (Bad Request)    papaparse.min.js:6
_readChunk    papaparse.min.js:6
_nextChunk    papaparse.min.js:6
stream    papaparse.min.js:6
t    papaparse.min.js:6
parsedata    timeline.js:299
Papa.parse.complete    timeline.js:292
parseChunk    papaparse.min.js:6
_chunkLoaded    papaparse.min.js:6
(anonymous function)    papaparse.min.js:6

Currently the stable version we having to run is 4.0.7

mholt commented 9 years ago

Looks like the server isn't liking the GET request to download the CSV file. Does it work in the browser? Are you using the exact same code you are using above? What is the server software (and version)?

mcshaman commented 9 years ago

Server: Microsoft-IIS/8.0 X-Powered-By: ASP.NET

Code I am using is something like this:

Papa.parse( somefile.csv, {
    delimiter: ",",
    header: true,
    download: true,
    skipEmptyLines: true,
    complete: function( res ) {
        // reference to another papaparse function
    }
} );

I am loading multiple csv's. I am currently using the complete function to daisy chain the next papaparse file request. So effectively synchronously loading each file.

mholt commented 9 years ago

Yeesh. Well, that should be fine -- I'm wondering if something is up with your IIS config. How does loading the csv by pasting the url in the browser work? What are the response headers? What are the request headers, for that matter? (Thanks for the stack traces, now the headers would be useful too.) Do IIS logs say anything?

mcshaman commented 9 years ago

IIS logs would be hard to get... They are another department.

Can I send you the headers privately? I don't want to have to clean them up for public viewing.

mholt commented 9 years ago

Yes, that'll be fine - Matthew dot Holt a.t Gmail will do.

mcshaman commented 9 years ago

Sent. Thanks for that.

mholt commented 9 years ago

Got it, thanks. Looks like in 4.1, a Range header is sent with the request whereas it wasn't in 4.0. This condition sets the Range header even if the input is not being chunked so I think this is a bug, although IIS is responding rather strictly with a 400.

@bluej100 I've been swamped this month - do you happen to see a simple way to fix it? (If not, I can take a look more this weekend...)

bluej100 commented 9 years ago

Ah, I see. So before https://github.com/mholt/PapaParse/commit/57a7349c41502afff0328bab918d0e10f8b8fd80 , we did the read in a single chunk unless config.step || config.chunk. I think @mcshaman may be able to pass chunkSize: null to force a single read, and I feel like defaulting to streaming is reasonable, but perhaps we could restore the configCopy.chunkSize = null deleted in https://github.com/mholt/PapaParse/commit/0be4b502f003f47fd66b9bd8d75caca98f4c74e8 ?

mholt commented 9 years ago

@bluej100 Sounds like a plan. I'm on it right now. Thanks for looking into it!

PashaBiryukov commented 8 years ago

Hi guys, so i have downloaded the latest papaparse version 4.1.2 'beforeFirstChunk' works great but still having issues with the 'worker: true' as a part of configuration while using the 'beforeFirstChunk' event. wasn't that issue solved in 4.1.2?

Thanks for replying.

dabernathy89 commented 2 years ago

Hey, so I know this is 6 years later - but it appears that worker: true and beforeFirstChunk() are still incompatible; it would be great if this were mentioned in the docs.

srafay commented 1 year ago

beforeFirstChunk (with worker: true) still gives could not be cloned error on latest release.