mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.57k stars 1.15k forks source link

chunkSize do not have any effect, defaults to around 65KB #616

Open apmcodes opened 5 years ago

apmcodes commented 5 years ago

Trying to set chunkSize to 50Kb but no matter what I set it seems to read round 65Kb chunk. Have tried all the 3 settings individually, but do not have any effect on chunk size (number of lines read from csv on each chunk call back remains the same)

options.chunkSize = 40000

Papa.RemoteChunkSize = 40000;

Papa.LocalChunkSize = 40000;

Even after setting options.chunkSize = null, Papa parses in multiple chunks

Please help ...

apmcodes commented 5 years ago

duh ...

Serrulien commented 5 years ago

Hi, can you show your configuration ?

minified papaparse version : 4.6.0

After some test i've found that setting worker: true will read the local file with the default size chunk (which is 10MiB) even when setting Papa.LocalChunkSize to another value.

For that, i used the "large file" (~49MiB) provided in the demo and the following configuration (from the documentation) :

Papa.parse(file, {
    delimiter: "",  // auto-detect
    newline: "",    // auto-detect
    quoteChar: '"',
    escapeChar: '"',
    header: false,
    transformHeader: undefined,
    dynamicTyping: false,
    preview: 0,
    encoding: "",
    worker: false,
    comments: false,
    step: undefined,
    complete: parseComplete,
    error: undefined,
    download: false,
    skipEmptyLines: false,
    chunk: chunkComplete,
    fastMode: undefined,
    beforeFirstChunk: undefined,
    withCredentials: undefined,
    transform: undefined,
    delimitersToGuess: [',', '\t', '|', ';', Papa.RECORD_SEP, Papa.UNIT_SEP]
});

var nbChunks = 0;

function parseComplete(results, file)
{
    console.info("parseComplete");
    console.log(nbChunks);
    nbChunks = 0;
}

function chunkComplete(results, parser)
{
    nbChunks++;
}

Let's play with that while altering worker and Papa.LocalChunkSize

worker Papa.LocalChunkSize nbChunks
false default (10*2**20) 5 ✔️
true default (10*2**20) 5 ✔️
false 2**20 48 ✔️
true 2**20 5 ❌

As a workaround, i set worker: false and a function in chunk. Seems to work so far. @apmcodes hope that helped you

Serrulien commented 5 years ago

forgot to say that when you set worker to false, it won't launch any workers of course

Serrulien commented 5 years ago

I checked the old issues. Workers do use the given chunk size with the chunkSize configuration property (undocumented). Avoid using Papa.LocalChunkSize with workers.

apmcodes commented 5 years ago

@Serrulien Thank you very much for the detailed explanation. Sorry for the late reply.

Please note: Using PapaParse in an Express app using multer middleware to upload file as multi-part.

It seems that as I'm using cloud service (S3) as remote file location and using aws-s3 sdk STREAMING api, chunkSize do not seems to have any effect (not sure if streaming is causing this issue).

The chunk size received seems to hover around 15KB (~300 rows with few columns)

NOTE: Even while streaming csv file from browser directly (no cloud storage) to PapaParse in the express app, observed the same behaviour of chunkSize.

Config
            header: false, 
            skipEmptyLines: true,
            chunk: this.importDB.bind(this), 
            beforeFirstChunk: this.importModel.bind(this),
            complete: this.importFinish.bind(this, this.cb),
            error: this.importError.bind(this),
            encoding: "utf8",
            preview: 0,
            chunkSize: 40000
            // chunkSize : 1024*1024*10,    // No effect

Info fetched from PapaParse cursor object

results count 687
receivedSize 47657
akash-rajput commented 4 years ago

Any updates on this?

WarrenWilkinson commented 2 years ago

I had this issue using fs.createReadStream to create the file. It appears that there is a buffer inside the stream that's about 10 MB. So it's not PapaParse's fault.

If this is your issue, you can pass parameters to fs.createReadStream to let it buffer more.
Something like this snippet should get you started...

Papa.LocalChunkSize =  Papa.LocalChunkSize * 10;
const file = fs.createReadStream(dataPath, { highWaterMark: Papa.LocalChunkSize });