mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.3k stars 1.14k forks source link

Skip lines future (Feature Suggestion) #738 #1021

Closed bhuvaneshwararaja closed 9 months ago

bhuvaneshwararaja commented 9 months ago

Issues No : #738

Description:

In this feature enhancement, Added the capability to skip a defined number of lines when converting a CSV (Comma-Separated Values) file to JSON format. This functionality offers users more flexibility when dealing with CSV files with header information or metadata at the beginning, allowing for a more efficient and precise conversion process.

Changes Made:

  1. Introduced a new configuration parameter, skipFirstNLines, which allows users to specify the number of lines to skip at the beginning of the CSV file during the conversion to JSON. This parameter supports values from 0 and above.
  2. Implemented validation for the skipFirstNLines configuration parameter to ensure that it does not negatively impact the parsing process, even if it is set to 0 or given a negative value.
  3. Enhanced the user interface of the player.html file by adding an input field that allows users to input the desired value for skipFirstNLines.
  4. Updated the player.js script to retrieve and utilize the skipFirstNLines input from the user and incorporate it into the parser's configuration.

Test cases:

Included test cases to validate the functionality of the skipFirstNLines feature, ensuring that it behaves correctly under various scenarios.

Test 1: "Skip First N number of lines , with header",

Test 2: "Skip First N number of lines , without header",

Test 3: "Skip First N lines with value as 0 with header false",

Test 4: "Skip First N lines with value as 0 with header true",

Test 5: "Skip First N lines with value less then 0, with header true",

Test 6: "Without Skip First N Lines config",

pokoli commented 9 months ago

What you are proposing is already implemented using the preview config option.

bhuvaneshwararaja commented 9 months ago

Hi @pokoli , I have gone through the documentation (https://www.papaparse.com/docs) for the preview options and got to know that will help to parse the expected number of rows in csv example: If we have 20 rows in csv, If the preview value is 10 . Out of 20 only 10 rows will be parsed .

But this PR will add a config to skip the N number of rows in the beginning of the csv. Example: If we have 20 rows in csv , If skipNLines value is 10 . Out of 20 , first 10 rows will be skipped .

Please check it once again , it is relevant to the issue #738 as well.

pokoli commented 9 months ago

Ok, then we need to document this new flag to explain which is the ussage.

I think the flag will be better named as skipFirstLines.

Could you have a look at the test suite? it seems you introduced some failing tests.

bhuvaneshwararaja commented 9 months ago

Hi @pokoli , Thanks Sure i will check those test suite , I will update the PR with doc

bhuvaneshwararaja commented 9 months ago

Hi @pokoli , I have updated the PR

janisdd commented 9 months ago

Just a small note: This does not work for quoted fields...

1,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
2,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
3,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
4,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C
5,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C

5 lines with 12 columns

Papa.parse('1,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n2,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n3,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n4,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C\n5,"B\nB",C,A,"B\nB",C,A,"B\nB",C,A,"B\nB",C', 
{skipFirstNLines: 5})

skips the "real" first line (5 because \n after each line).

pokoli commented 9 months ago

@janisdd It will be great if you can fill an issue/merge request to make this work with quotes also.

janisdd commented 9 months ago

Well, no... I don't think this feature should be added to the library. Here are my thoughts:

However, here are some notes if someone wants to implement it anyway:

I'm not sure if this would also work for streaming/chunking (or whatever, I just use the normal parsing). But as far as i know there is only one core logic/function for the actual parsing... which means this should also work for streaming

jkruke commented 4 months ago

I think this feature hasn't been implemented as wanted in the referenced issue #738 ! The example CSV was:

This is a data file generated by some old software.
Next line will contain a headers of parameters.
Temperature, Humidity, Voltage
22.5, 45.5, 220
23.0, 44.0, 219

Expected output should be (to be fair, it wasn't precisely specified):

Temperature, Humidity, Voltage
22.5, 45.5, 220
23.0, 44.0, 219

But the actual output with skipFirstNLines=2 is (according to the test cases):

This is a data file generated by some old software.
23.0

Analogue to the following test case in the code: https://github.com/mholt/PapaParse/pull/1021/files#diff-e0ce8cb4901057c1880bee545909a64f38c7383b4d41982d6f2db9a8ec81eac7R1588

{
        description: "Skip First N number of lines , with header and 3 rows",
        input: 'a,b,c,d\n1,2,3,4\n4,5,6,7',
        config: { header: true, skipFirstNLines: 1 },
        expected: {
            data: [{a: '4', b: '5', c: '6', d: '7'}],
            errors: []
        }
    }

This test case is not realistic because it does not reflect a CSV with some preamble lines to be skipped. should be rather: input: 'to-be-ignored\na,b,c,d\n1,2,3,4', and expected.data: [{a: '1', b: '2', c: '3', d: '4'}].

If someone wanted to skip rows of the record set, it should not be done during the parsing phase, but later when working with the data sets by doing some simple postprocessing such as data = data.slice(1).

@bhuvaneshwararaja, @pokoli could you please take a look at it and give me some feedback about my thoughts? :)