johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.84k stars 211 forks source link

Error in `splita`/`splitax` when field contains a single non-string value #1589

Closed valxntine closed 2 days ago

valxntine commented 2 months ago

I have a CSV that has a field EAN, this is either a single value:

7622210953278

or a concatenated string of values:

"7622210953278#7622210953279#7622281623271"

When I split the field, it works when it's a case where there's multiple values, otherwise it ends up replacing the field with (error). I'm converting this CSV to JSON, for what it's worth.

The command I'm running is:

mlr --icsv --ojson --from products.csv put '$EAN = splita($EAN, "#")' then cut -f SKU,TITLE,LONG_DESC,EAN

Which outputs something like this (albeit longer):

{
  "SKU": 0000000,
  "LONG_DESC": "some description",
  "EAN": (error),
  "TITLE": "some title",
},
{
  "SKU": 0000000,
  "LONG_DESC": "some description",
  "EAN": [5000159372343, 5000159470490],
  "TITLE": "some title",
},
{
  "SKU": 0000000,
  "LONG_DESC": "some description",
  "EAN": (error),
  "TITLE": "some title",
},
{
  "SKU": 0000000,
  "LONG_DESC": "some description",
  "EAN": (error),
  "TITLE": "some title",
},

I've only just come across miller, so I may be missing something obvious, I've tried with both splita and splitax. Any advice or is this not something I can work around?

valxntine commented 2 months ago

As soon as I finished typing this I realised a possible solution and turns it out it worked: mlr --icsv --ojson --from products.csv put '$EAN = string($EAN)' then put '$EAN = splita($EAN, "#")' then cut -f SKU,TITLE,LONG_DESC,EAN | jq

Apologies for the short-lived issue, feel free to close!

johnkerl commented 2 months ago

Yes casting to string works since

7622210953278

type-infers as int but

7622210953278#7622210953279#7622281623271

type-infers to string.

~And, splita returns error when given a string.~

[Typo found 2024-08-25]: And, splita returns error when given a string.

Maybe splita ought to do to-string for the user, even when not asked to do so ...

I've not been super-consistent with this TBH.

On the one hand I never ever wanted Miller to be as loosey-goosey with string vs int as, say, JavaScript or PHP -- I always wanted "2"+3 to be error not 5. So most of the time I prefer to return error rather than doing implicit type-casts.

On the other hand, I do some implicit type-casts -- the dot operator for example, since I felt there was nothing else meaningful for . to do so I might as well stringify all inputs. So "3" . 4 is "34". Also substr will stringify its first argument: substr1(123456789, 3, 5) gives "345".

The split functions should probably do the same ...

valxntine commented 2 months ago

The splita and splitax docs both say that they split a string into array, I think it could be fair to assume as a user that the argument would be converted to a string.

I'm happy to take a stab and raise a PR for this in the next few days, if you'd like?

johnkerl commented 2 months ago

The splita and splitax docs both say that they split a string into array, I think it could be fair to assume as a user that the argument would be converted to a string

Cool!

Yes, happy to take a PR if you want! 😎