javascriptdata / danfojs

Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
https://danfo.jsdata.org/
MIT License
4.8k stars 209 forks source link

Double CSV Load #107

Closed GantMan closed 3 years ago

GantMan commented 3 years ago

Right now the read_csv loads twice.

Code example (using yours looking at latest)

https://codepen.io/risingodegua/pen/bGwPGMG

The issue

image

This has been happening for a few versions. When the CSV is hudreds of MB this basically doubles the load time.

GantMan commented 3 years ago

I don't see why it's loading the files twice, is it calling the function twice?

https://github.com/opensource9ja/danfojs/blob/358577496131b3fac8d22db2e9fc664a00bc2d83/danfojs/src/io/reader.js#L21-L40

risenW commented 3 years ago

@GantMan This is really strange to me as well. The code basically uses tfjs.data module to load the file.

I'll investigate with tf.data.csv separately first.

risenW commented 3 years ago

@GantMan Just found out this is coming from the tensorflow csv function. Tried this:

     let data = [];
      const csvDataset = tf.data.csv("https://s3.amazonaws.com/ir_public/temp/chess_labels.csv");
      const column_names = await csvDataset.columnNames();
      const sample = csvDataset.take(10);
      await sample.forEachAsync((row) => data.push(Object.values(row)));
      console.log(data);

and in the network tabs, I got this:

Screen Shot 2021-02-14 at 6 59 36 PM

Possible solution is to load and parse CSV using another library like papaparse.

GantMan commented 3 years ago

I'll go file a ticket with TFJS on this and see if I can fix it.

GantMan commented 3 years ago

Hey bud, this issue is definitely with Danfo and not TFJS.

image

See proof of concept here: https://codepen.io/gantman/pen/abBRObO

risenW commented 3 years ago

@GantMan Adding this line to the tensorflow csv function causes it to load twice:

await sample.forEachAsync((row) => data.push(Object.values(row)));

This code:

 const csvDataset = tf.data.csv(
    "https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv"
  );
  const column_names = await csvDataset.columnNames();
  const sample = csvDataset.take(10);
  await sample.forEachAsync((row) => data.push(Object.values(row)));
  console.log(data);
dfd
  .read_csv(
   "https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv"
  )  

currently gives me:

Screen Shot 2021-03-08 at 4 22 00 PM

And we are currently calling this function in danfo's read_csv function. I'm still investigating how to solve this

risenW commented 3 years ago

This is a very weird bug. It happens only in the tfjs browser version. And I also notice it loads the second time from the cache.

github-actions[bot] commented 3 years ago

Stale issue message