pola-rs / nodejs-polars

nodejs front-end of polars
https://pola-rs.github.io/nodejs-polars/
MIT License
433 stars 43 forks source link

pl.readCSV fail with tab separator #155

Closed gplanansky closed 8 months ago

gplanansky commented 10 months ago

What version of polars are you using?

version: "0.8.4"

What operating system are you using polars on?

mac os 13.6

What node deno version are you using

deno 1.39.2

Describe your bug.

Below,
pl.readCSV(data_tsv, { sep: "\t" }); on the tsv file fails to separate the data items, whereas pl.readCSV(data_csv, { sep: "," }); on the csv file succeeds.

(This is from an example using polars in deno that evidently worked 4 months ago: https://github.com/rgbkrk/denotebooks/blob/main/10.2_Polar%20DataFrames.ipynb)

Running deno in a directory with the data_tsv.txt, data_csv.txt files:

data_csv.txt data_tsv.txt

$ cat data_tsv.txt
col1    col2    col3
r1c1    r1c2    r1c3
r2c1    r2c2    r2c3

$ cat data_csv.txt
col1,col2,col3
r1c1,r1c2,r1c3
r2c1,r2c2,r2c3

$ od -c data_tsv.txt
0000000    c   o   l   1  \t   c   o   l   2  \t   c   o   l   3  \n   r
0000020    1   c   1  \t   r   1   c   2  \t   r   1   c   3  \n   r   2
0000040    c   1  \t   r   2   c   2  \t   r   2   c   3  \n            
0000055
$ od -c data_csv.txt
0000000    c   o   l   1   ,   c   o   l   2   ,   c   o   l   3  \n   r
0000020    1   c   1   ,   r   1   c   2   ,   r   1   c   3  \n   r   2
0000040    c   1   ,   r   2   c   2   ,   r   2   c   3  \n            
0000055
$ deno
Deno 1.39.3
exit using ctrl+d, ctrl+c, or close()
REPL is running with all permissions allowed.
To specify permissions, run `deno repl` with allow flags.
> import pl from "npm:nodejs-polars";
undefined
>  let data_tsv = await Deno.readTextFile('data_tsv.txt');
undefined
> data_tsv
"col1\tcol2\tcol3\nr1c1\tr1c2\tr1c3\nr2c1\tr2c2\tr2c3\n"
> let df_tsv = pl.readCSV(data_tsv, { sep: "\t" });
undefined
> df_tsv.columns
[ "col1\tcol2\tcol3" ]
>  let data_csv = await Deno.readTextFile('data_csv.txt');
undefined
> data_csv
"col1,col2,col3\nr1c1,r1c2,r1c3\nr2c1,r2c2,r2c3\n"
> let df_csv = pl.readCSV(data_csv, { sep: "," });
undefined
> df_sv.columns
[ "col1", "col2", "col3" ]
Bidek56 commented 10 months ago

Please use pl.scanCSV until we have a PR for this issue. Thanks for understanding and binging this issue up.

const df = await pl.scanCSV( "data_tsv.txt" , { sep: "\t" }).collect()

Bidek56 commented 10 months ago

The code only allows for these extensions: [".tsv", ".csv"] else it it thinks it's a inline text. https://github.com/pola-rs/nodejs-polars/pull/156 has a fix for this issue.

Using version 0.8.3 should work as well.

import pl from "npm:nodejs-polars@0.8.3";

gplanansky commented 10 months ago

thanks. And roger the allowed extensions -- I only used ".txt" here because paste, click to add files does not support the .tsv extension.

Bidek56 commented 8 months ago

Can you please check: "nodejs-polars": "0.9.0" ? Thx

gplanansky commented 8 months ago

@Bidek56
It works, using nodejs-polars 0.9.0, data files with .txt extensions yield same correct results as data files with csv, tsv extensions. yay!

Tested using the same example files:

$ ll
-rw-r--r--  1 george  staff  66621 Mar  9 02:49 data.csv
-rw-r--r--  1 george  staff  66621 Mar  9 02:49 data.tsv
-rw-r--r--  1 george  staff  66621 Mar  9 02:41 data_csv.txt
-rw-r--r--  1 george  staff  66621 Mar  9 02:41 data_tsv.txt
$ cat data_csv.txt | head -1
scalerank,featurecla,labelrank,sovereignt,sov_a3,adm0_dif,level,type,admin, ...
$ cat data_tsv.txt | head -1
scalerank   featurecla  labelrank   sovereignt  sov_a3  adm0_dif ...
$ diff data.csv data_csv.txt 
$ diff data.tsv data_tsv.txt 

$ deno
Deno 1.41.2
exit using ctrl+d, ctrl+c, or close()
REPL is running with all permissions allowed.
To specify permissions, run `deno repl` with allow flags.
> import pl from "npm:nodejs-polars";
undefined
> pl.pl.version
"0.9.0"

> let csv = await Deno.readTextFile('data.csv')
> let dfcsv = pl.readCSV(csv, { sep: "," });
> dfcsv.columns
[
  "scalerank",  "featurecla", "labelrank",  "sovereignt", ...

> let tsv = await Deno.readTextFile('data.tsv')
> let dftsv = pl.readCSV(tsv, { sep: "\t" });
> dftsv.columns
[
  "scalerank",  "featurecla", "labelrank",  "sovereignt", ...

> let data_csv = await Deno.readTextFile('data_csv.txt');
let df_csv = pl.readCSV(data_csv, { sep: "," });
> df_csv.columns
[
  "scalerank",  "featurecla", "labelrank",  "sovereignt", ...

> let data_tsv = await Deno.readTextFile('data_tsv.txt');
> let df_tsv = pl.readCSV(data_tsv, { sep: "\t" });
> df_tsv.columns
[
  "scalerank",  "featurecla", "labelrank",  "sovereignt", ...