kevinlawler / kona

Open-source implementation of the K programming language
ISC License
1.36k stars 138 forks source link

CSV load issues #587

Closed danlm closed 4 years ago

danlm commented 4 years ago

When reading a CSV: ("CSISIFSCSI";"|")0: `test.csv blank entries in rows are dropped, leading to misaligned rows and a domain error. test.txt

danlm commented 4 years ago

In addition, using lowercase types seems to cause type errors. What is the "DTZMm Y?" type?

tavmem commented 4 years ago

First look: check what happens in k2.8

I downloaded the file

$ cat test.txt
time|key|SKU|Indicator|items|price|SKID|public|newSKU
1234567891822|I|1234567|F|100|19.99|FIZZ||
1234567891823|J|1234568||101|11|||
1234567891824|S|1234569|F|102|15|FIZZ||
1234567891825|D|1234570|F|103|20.99|||
1234567891826|S|1234571|F|104|12|FIZZ||
1234567891827|S|1234572|F|105|16|||
1234567891828|T|1234573||106|21.99|FIZZ||
1234567891829|T|1234574||107|13|||
1234567891830|D|1234575|F|108|17|||
1234567891831|I|1234576|F|109|22.99|||42384867
1234567891832|D|1234577|F|110|14|||
1234567891833|D|1234578|F|111|18|||
1234567891834|I|1234579|G|112|23.99|||42384758
1234567891835|J|1234580||113|15|||
$

Tried reading it in k2.8

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("CSISIFSCSI";"|")0: `test.txt
length error
("CSISIFSCSI";"|")0: `test.txt
  ^
>  
tavmem commented 4 years ago

Tried reading the file in kona:

$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSISIFSCSI";"|")0: `test.txt
Segmentation fault (core dumped)
$ 
tavmem commented 4 years ago

Page 179 of the k2.0 reference manual has a sample file that I re-created:

$ cat f
1050    1.234   abcdef  ghi
234     1e50     gqw    xy 
$ 

I tried reading the file into k2.8 using the command listed on page 180 of the reference manual:

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("IFCS"; 7 9 10 3) 0: `f
length error
("IFCS"; 7 9 10 3) 0: `f
  ^
>
tavmem commented 4 years ago

Same exercise in kona:

$ cat f
1050    1.234   abcdef  ghi
234     1e50     gqw    xy 
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("IFCS"; 7 9 10 3) 0: `f
(!0;0#0.0;();0#`)

No length error ... but not the result listed on page 180 of the reference manual.

tavmem commented 4 years ago

The file test.txt only has 9 fields, but the load command listed in the issue description has 10 characters. Dropping the last character gives this result in k2.8:

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("CSISIFSCS";"|")0: `test.txt
(("time"
  "1234567891822"
  "1234567891823"
  "1234567891824"
  "1234567891825"
  "1234567891826"
  "1234567891827"
  "1234567891828"
  "1234567891829"
  "1234567891830"
  "1234567891831"
  "1234567891832"
  "1234567891833"
  "1234567891834"
  "1234567891835")
 `key `I `J `S `D `S `S `T `T `D `I `D `D `I `J
 0N 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 1234578 1234579 1234580
 `Indicator `F ` `F `F `F `F ` ` `F `F `F `F `G `
 0N 100 101 102 103 104 105 106 107 108 109 110 111 112 113
 0n 19.99 11 15 20.99 12 16 21.99 13 17 22.99 14 18 23.99 15
 `SKID `FIZZ ` `FIZZ ` `FIZZ ` `FIZZ ` ` ` ` ` ` `
 ("public"
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  ""
  "")
 `newSKU ` ` ` ` ` ` ` ` ` `"42384867" ` ` `"42384758" `)
tavmem commented 4 years ago

Trying that exercise in kona (with 9 characters):

$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSISIFSCS";"|")0: `test.txt
Segmentation fault (core dumped)
$ 
tavmem commented 4 years ago

Since the field "SKU" is integer, the field "newSKU" should probably be integer also. In k2.8:

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("CSISIFSSI";"|")0: `test.txt
(("time"
  "1234567891822"
  "1234567891823"
  "1234567891824"
  "1234567891825"
  "1234567891826"
  "1234567891827"
  "1234567891828"
  "1234567891829"
  "1234567891830"
  "1234567891831"
  "1234567891832"
  "1234567891833"
  "1234567891834"
  "1234567891835")
 `key `I `J `S `D `S `S `T `T `D `I `D `D `I `J
 0N 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 1234578 1234579 1234580
 `Indicator `F ` `F `F `F `F ` ` `F `F `F `F `G `
 0N 100 101 102 103 104 105 106 107 108 109 110 111 112 113
 0n 19.99 11 15 20.99 12 16 21.99 13 17 22.99 14 18 23.99 15
 `SKID `FIZZ ` `FIZZ ` `FIZZ ` `FIZZ ` ` ` ` ` ` `
 `public ` ` ` ` ` ` ` ` ` ` ` ` ` `
 0N 0N 0N 0N 0N 0N 0N 0N 0N 0N 42384867 0N 0N 42384758 0N)

kona should give similar results.

tavmem commented 4 years ago

A very simple file (2 fields, 3 rows) and no blank entries, works in kona:

$ cat test2.txt
time|key
22|I
23|J
24|S
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CS";"|")0: `test2.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S)
tavmem commented 4 years ago

A first borderline case: As tested before, this file (renamed as t02.txt) works in kona;

$ cat t02.txt
time|key
22|I
23|J
24|S
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CS";"|")0: `t02.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S)

Adding an integer field, results in a "domain error"

$ cat t03.txt
time|key|SKU
22|I|67
23|J|68
24|S|69
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSI";"|")0: `t03.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S
 0N 67 68 69)
domain error
("CSI";"|")0: `t03.txt
           ^
>

t03.txt works in k2.8 (with no reported errors):

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("CSI";"|")0: `t03.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S
 0N 67 68 69)
tavmem commented 4 years ago

The next "border case" This now works (with no domain error);

$ cat t03.txt
time|key|SKU
22|I|67
23|J|68
24|S|69
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSI";"|")0: `t03.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S
 0N 67 68 69)

This fails:

$ cat t04.txt
time|key|SKU|Indicator
22|I|67|F
23|J|68|D
24|S|69|F
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSIS";"|")0: `t04.txt
Segmentation fault (core dumped)
$ 

But this works:

$ cat t05.txt
time|key|SKU|Indicator
22|I|67|92
23|J|68|93
24|S|69|94
$ 
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSII";"|")0: `t05.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S
 0N 67 68 69
 0N 92 93 94)
tavmem commented 4 years ago

After the lastest commit titled "fix segfault on t04.txt for issue 587" this works:

$ cat t04.txt
time|key|SKU|Indicator
22|I|67|F
23|J|68|D
24|S|69|F
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSIS";"|")0: `t04.txt
(("time"
  "22"
  "23"
  "24")
 `key `I `J `S
 0N 67 68 69
 `Indicator `F `D `F)

Trying the original "test.txt" results in

$ cat test.txt
time|key|SKU|Indicator|items|price|SKID|public|newSKU
1234567891822|I|1234567|F|100|19.99|FIZZ||
1234567891823|J|1234568||101|11|||
1234567891824|S|1234569|F|102|15|FIZZ||
1234567891825|D|1234570|F|103|20.99|||
1234567891826|S|1234571|F|104|12|FIZZ||
1234567891827|S|1234572|F|105|16|||
1234567891828|T|1234573||106|21.99|FIZZ||
1234567891829|T|1234574||107|13|||
1234567891830|D|1234575|F|108|17|||
1234567891831|I|1234576|F|109|22.99|||42384867
1234567891832|D|1234577|F|110|14|||
1234567891833|D|1234578|F|111|18|||
1234567891834|I|1234579|G|112|23.99|||42384758
1234567891835|J|1234580||113|15|||
$ 
$ rlwrap -n ./k
kona      \ for help. \\ to exit.

  ("CSISIFSCS";"|")0: `test.txt
(("time"
  "1234567891822"
  "1234567891823"
  "1234567891824"
  "1234567891825"
  "1234567891826"
  "1234567891827"
  "1234567891828"
  "1234567891829"
  "1234567891830"
  "1234567891831"
  "1234567891832"
  "1234567891833"
  "1234567891834"
  "1234567891835")
 `key `I `J `S `D `S `S `T `T `D `I `D `D `I `J
 0N 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 1234578 1234579 1234580
 `Indicator `F `"101" `F `F `F `F `"106" `"107" `F `F `F `F `G `"113"
 0N 100 11 102 103 104 105 0N 13 108 109 110 111 112 15
 0n 19.99 0 15 20.99 12 16 0n 0 17 22.99 14 18 23.99 0.0
 `SKID `FIZZ (nil)  `FIZZ (nil)  `FIZZ (nil)  (nil)  (nil)  (nil)  `"42384867" (nil)  (nil)  `"42384758" (nil)  
 ("public"

)
 `newSKU (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  (nil)  )
domain error
("CSISIFSCS";"|")0: `test.txt
                 ^
>

We have finally gotten to the original reported problem ... which was "blank entries in rows are dropped, leading to misaligned rows and a domain error."

tavmem commented 4 years ago

The remaining domain error problem is resolved in commit titiled: "fix additional domain error in issue 587" Proper handling of the blank entries is left to fix.

tavmem commented 4 years ago

A summary of what I found (so far):

1. Column Headings In k2.8 and in kona, when reading a file with column headings (like test.txt), the delimiter should have a space after it (to differentiate the command from a command to read of a file with no column headings).

Consider this file:

$ cat t06.txt
SKU|Indicator|items|price|SKID|public|newSKU
1234567|F|100|19.99|FIZZ||
1234568||101|11|||
1234576|F|109|22.99|||42384867

in k2.8

  ("ISIFSCI";"|")0: `t06.txt
(0N 1234567 1234568 1234576
 `Indicator `F ` `F
 0N 100 101 109
 0n 19.99 11 22.99
 `SKID `FIZZ ` `
 ("public"
  ""
  ""
  "")
 0N 0N 0N 42384867)

  ("ISIFSCI";"| ")0: `t06.txt
(`SKU `Indicator `items `price `SKID `public `newSKU
 (1234567 1234568 1234576
  `F ` `F
  100 101 109
  19.99 11 22.99
  `FIZZ ` `
  (""
   ""
   "")
  0N 0N 42384867))

2. Using lower case types causes type errors in kona:

  ("isifsci";"| ")0: `t06.txt
type error
("isifsci";"| ")0: `t06.txt
                ^
> 

I'm not so sure that the kona result is bad. Using lower case types in k2.8 produces a result that seems to have no relation at all to the input. That appears to be a bug in k2.8:

  ("isifsci";"|")0: `t06.txt
(2085964627 875770417 875770417 875770417
 28233 31814 12668 31814
 1835365481 2083532849 2083598385 2084122673
 4.306337e+21 0.0001661524 5.237826e+36 0.0001661263
 19283 18758 31868 31868
 2088533104 -140152396 -140152428 -140152364
 1400333678 858927370 858927370 942879284)

3. Kona drops blank entries in rows, leading to misaligned rows Kona uses the the C library function strtok to parse the input rows. That causes the problem. Consider this simple C program that uses strtok:

$ cat st.c
#include <stdio.h>
#include <string.h>
int main()
{   char m[100] = "1234568||101|11|||";
    char y[2]; y[0]='|';
    char* tok;
    tok = strtok(m, y);
    while (tok != 0) {
        printf(" %s\n", tok);
        tok = strtok(0, y); }
    return (0); }

Then

$ gcc -o st st.c
$ rlwrap -n ./st
 1234568
 101
 11
$

There are 7 elements in the input, but only 3 in the output. We can put a space at each blank entry

$ cat st2.c
#include <stdio.h>
#include <string.h>
int main()
{   char m[100] = "1234568| |101|11| | | ";
    char y[2]; y[0]='|';
    char* tok;
    tok = strtok(m, y);
    while (tok != 0) {
        printf(" %s\n", tok);
        tok = strtok(0, y); }
    return (0); }

and that seems to do the trick

 gcc -o st2 st2.c
$ rlwrap -n ./st2
 1234568

 101
 11

$ 

However, spaces are not needed in k2.8 They should not be necessary in Kona, and ... if we put the spaces in our latest test file:

$ cat t07.txt
SKU|Indicator|items|price|SKID|public|newSKU
1234567|F|100|19.99|FIZZ| | 
1234568| |101|11| | | 
1234576|F|109|22.99| | |42384867

then kona still does not give exactly the results we are looking for. The spaces are retained for types S and C. They are converted to zero for type I.

  ("ISIFSCI";"| ")0: `t07.txt
(`SKU `Indicator `items `price `SKID `public `newSKU
 (1234567 1234568 1234576
  `F `" " `F
  100 101 109
  19.99 11 22.99
  `FIZZ `" " `" "
  (," "
   ," "
   ," ")
  0 0 42384867))

It appears that using strtok may not be viable in getting the results we are looking for.

tavmem commented 4 years ago

One possible strategy is to get a copy of strtok from the GNU C library, and modify it to meet our objectives.

bakul commented 4 years ago

Page 179 of the k2.0 reference manual has a sample file that I re-created:

$ cat f
1050    1.234   abcdef  ghi
234     1e50     gqw    xy 
$ 

I tried reading the file into k2.8 using the command listed on page 180 of the reference manual:

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("IFCS"; 7 9 10 3) 0: `f
length error
("IFCS"; 7 9 10 3) 0: `f
  ^
>

I am assuming the second line in f ends with y. Since each field is specified by the number of characters, >xy< is of length 2 not 3 so K is correct. If you add a space at the end as in >xy< it will do the right thing.

bakul commented 4 years ago

("CSISIFSCSI";"|")0: `test.txt length error

This is incorrect. The second arg needs to be a vector so replace "|" with ,"|".

("CSISIFSCSI";,"|")0: `test.txt

But this still won't work since this file has a header in the first line so the "I" and "F" domains will fail. [Edit: looks like the first line is treated specially by k so the prev. sentence is not true] In other words, there may or may not be a bug but the test needs to be fixed first!

tavmem commented 4 years ago

Regarding the example in the k2.0 reference manual: There was already a space at the end of xy Consider this file (with an additional character):

$ cat g
1050    1.234   abcdef  ghi
234     1e50     gqw    xyz
$ 
$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("IFCS"; 7 9 10 3) 0: `g
length error
("IFCS"; 7 9 10 3) 0: `g
  ^
> 

It still fails.

bakul commented 4 years ago

This worked for me with k-3.2 (under FreeBSD). It should also work for k-2.8. Make sure there exactly 29 characters per line (not counting the newline at the end).

tavmem commented 4 years ago

Regarding your comments about fixing the test: I agree that that 2nd arg needs to be a vector (since there are column headings). The only other thing that needs fixing is that instead of "CSISIFSCSI" drop the final occurence of "S" Then it works fine in k2.8

$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("CSISIFSCI";"| ")0: `test.txt
(`time `key `SKU `Indicator `items `price `SKID `public `newSKU
 (("1234567891822"
   "1234567891823"
   "1234567891824"
   "1234567891825"
   "1234567891826"
   "1234567891827"
   "1234567891828"
   "1234567891829"
   "1234567891830"
   "1234567891831"
   "1234567891832"
   "1234567891833"
   "1234567891834"
   "1234567891835")
  `I `J `S `D `S `S `T `T `D `I `D `D `I `J
  1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 1234578 1234579 1234580
  `F ` `F `F `F `F ` ` `F `F `F `F `G `
  100 101 102 103 104 105 106 107 108 109 110 111 112 113
  19.99 11 15 20.99 12 16 21.99 13 17 22.99 14 18 23.99 15
  `FIZZ ` `FIZZ ` `FIZZ ` `FIZZ ` ` ` ` ` ` `
  (""
   ""
   ""
   ""
   ""
   ""
   ""
   ""
   ""
   ""
   ""
   ""
   ""
   "")
  0N 0N 0N 0N 0N 0N 0N 0N 0N 42384867 0N 0N 42384758 0N))

It fails in kona because of the blank entries (which cause misalignment in kona).

tavmem commented 4 years ago

Thanks !! The example in the k2.0 manual works in k2.8 and in kona if there are 2 spaces at the beginning of each line, AND each line is exactly 29 characters long.

$ cat h
  1050    1.234   abcdef  ghi
  234     1e50     gqw    xy 
$
$ rlwrap -n ./k
K 2.8 2000-10-10 Copyright (C) 1993-2000 Kx Systems
Evaluation. Not for commercial use. 
\ for help. \\ to exit.

  ("IFCS"; 7 9 10 3) 0: `h
(1050 234
 1.234 1e+50
 ("  abcdef  "
  "   gqw    ")
 `ghi `xy)
bakul commented 4 years ago

Looks like the standard strtok eats null fields. So "123|||||45|" will return "123" and on second call (with first arg NULL), it will return"45" and then a NULL. I think it is better to not use it. A better solution may be something like

int count_fields(char*str, char*sep);
int split_fields(char* str, char* sep, char** fields, int n);

The first merely returns the count of instances of sep + 1. This can be used to allocate an array of N char*, using which you can make the second call, some of which may be just "". For kona use you only do this allocation once and count-fields gives you a simple test to see if a line has the right number of fields! If you want, I can write them (without touching kona source code). Later this week.

tavmem commented 4 years ago

Your strategy seems quite viable. It would be great if you had the time to write the functions.

As mentioned above, the strategy I was pursuing was to get a copy of the code for strtok (which I've done), study the code, and then modify the code to create a version that did not eat the null fields. I'm not sure how long it would take me to do that.

bakul commented 4 years ago

Note that Gnu strtok is under LGPLv2.1 (or v3). IANAL but I believe it is more restrictive than the license kona uses.

bakul commented 4 years ago

I don't need to write anything as strsep() is exactly what you need instead of strtok()! It is not part of C or POSIX standard but it handles null fields & Linux + *BSDs (includes MacOS) provide it. If Windows doesn't, you can use the one from FreeBSD. See the example use below:

#include <stdio.h>
#include <string.h>

int main() {
        char m[] = "123|456|..|78|";
        for (char *last = m;;) {
                char *tok = strsep(&last, ".|");
                if (tok == 0) break;
                printf("<%s>", tok);
        }
        printf("\n");
        return 0;
}

This will produce <123><456><><><><78><>

tavmem commented 4 years ago

Thanks !!!!!