mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.46k stars 1.14k forks source link

Local file parsing adds an extra row #447

Open davidroeca opened 6 years ago

davidroeca commented 6 years ago

Thanks for such a great library! Just pointing out a minor issue I've found both in the application I'm working with and in the demo--when parsing a local file, an additional (blank) row is added.

Try the following file: broken.txt (file copy of the string example) at http://papaparse.com/demo to see what I mean. header: true gives the explicit error where it says the final line is missing all columns after column 1, which is recorded as the empty string.

Not a huge issue as I can add skipEmptyLines: true as a work-around, but the file itself doesn't have this final line.

pokoli commented 6 years ago

I think the problem is related to how input files are handled. Probably there is some empty line added there.

I've parsed the same file on node which produces the correct behaviour:

You can try it with the following command:

Papa = require('papaparse.js')
Papa.parse(fs.createReadStream('/tmp/mozilla_sergi0/broken.txt'), {delimiter: ',', header: true, complete: function(results){ console.log(results);}})

On a node console executed from the same directory where you have the papaparse.js file

The demo code is available on the gh-pages of this repository. Maybe that's what it has to be fixed.

davidroeca commented 6 years ago

@pokoli the same issue exists in my app, with version 4.3.6, so I'm pretty sure this isn't just an issue with the demo on the site.

EDIT: This works in node with 4.3.6 as well, just as the text on the demo site works flawlessly. Trying to find a good way to emulate a file-like object from the browser in nodejs, but I believe the problem boils down to the browser's file object implementation

davidroeca commented 6 years ago

I've tested this on firefox and chrome and it seems that the FileReader API's readAsText method always adds a line break to the end of a file that doesn't originally have them.

There are very few hints in the standard that mention these assumptions other than the notion of converting line endings to native, which may explain a similar algorithm to how each line is handled in the implementation (strangely, default blobs are transparent and not native).

I'm not sure if there's a graceful way to handle this case, but if there's some way to trim the last trailing newline, that could be one approach.

pokoli commented 6 years ago

Yes, I had the feeling that will be some weird behaviours about browsers. Thanks for confirming.

I don't like the idea of handling this behaviour on the library as it will probably end with an undtested code that may become obsolte when the browsers change the behaviour.

I think the easiet solution is to skipEmptyLines. Probably we should add a note on the docs (they are in the gh-pages branch of this repository) explaining the behaviour of the browsers and recomending to use the skipEmptyLines flag if reading files from FileReader API.

What dou you think?

davidroeca commented 6 years ago

I agree, especially since this behavior isn't even part of a standard so it may change. It probably makes sense to document, possibly something to highlight in the demo portion as well.

I added an additional config in a fork which only skips empty lines at the end of the file, though skipEmptyLines might be enough here.

https://github.com/mholt/PapaParse/pull/446 should be merged first in either case

pokoli commented 6 years ago

I'm wating to the new parameter to merge #446.

I don't think we should have a flag to skip the last row on base.

shamess commented 6 years ago

I'm working on the new parameter in #446 right now. Should be open for PR soon.

joel-zz commented 5 years ago

The skipEmptyLines option does not work on a Macintosh .csv. The carriage return will come through as:

11: ["↵"]

@shamess

webstoreportal commented 3 years ago

Excel saves an extra line at the end of the file (this new line is in the source file itself, I haven't seen that new line behaviour replicated when using File.readText, though the API could have changed that behaviour already)

Tested by manually removing in notepad then opening in Excel and re-saving as .csv

Papa Parse appears to start trying to parse the newline as the first field of a new row

resolves to a null row and provides an error referring to that "row" (blank new line)

code: "TooFewFields"
message: "Too few fields: expected n fields but parsed 1"
row: m
type: "FieldMismatch"

Checked by using Notepad++ with Show Symbol > Show All Characters turned on CR LF

For Mac, does your file use the same line character standard, consistently? (so the config option would help here)

mindrunner commented 2 years ago

I just stumbled over this. I create RFC 4180 CSV Files with help of csv-writer. I my end-to-end tests, I am using papaparse to parse and evaluate the result. According to RFC 4180, each line (record) must be terminated by a linebreak (including the last one). My interpretation is, that every csv file must end with an empty line (which is pretty much best practice in most text based formats, source files, etc. as well.)

Thus, papaparse should not add an empty record at the end which the developer needs to suppress by manually enabling skipEmpty lines.

Aminat00 commented 1 year ago

skipEmptyLines: true

Hello could you specify where exactly I can add this code?

davidroeca commented 1 year ago

@Aminat00 see the API here, in particular noting the config:


const data = Papa.parse(dataRaw, { skipEmptyLines: true })
Aminat00 commented 1 year ago

@Aminat00 see the API here, in particular noting the config:

const data = Papa.parse(dataRaw, { skipEmptyLines: true })

Thank you very much