metaeducation / rebol-issues

6 stars 1 forks source link

PARSE string with quote, comma, and tab delimiting #1042

Closed rebolbot closed 8 years ago

rebolbot commented 15 years ago

Submitted by: Sunanda

This works as expected:

    x: rejoin [{"a,"} tab "b"]
    == {"a"^-b}
    parse/all x tab
    == ["a," "b"]    ;; block length 2 as expected

But here, parse effectively promotes the comma to [tab ","]

    x: rejoin [{"a",} tab "b"]
    {"a",^-b}
    parse/all x tab
    == ["a" "," "b"]   ;; block length 3 !?

R2 does the same. Issue found in R2 while debugging a live application that attempts to read a tab-delimited file. At they very least the issue is a gotcha that needs documenting so we can develop robust import routines.

    x: rejoin [{"a",} tab "b"]
    parse/all x tab

CC - Data [ Version: alpha 66 Type: Bug Platform: All Category: Parse Reproduce: Always Fixed-in:none ]

rebolbot commented 15 years ago

Submitted by: BrianH

This is a bug, not a gotcha that needs documenting. Is this one more consideration for the PARSE rewrite, or a quick fix?

rebolbot commented 15 years ago

Submitted by: Carl

I don't understand what result you want. The delimiters are in conflict. The quotes on the string make it a single "atom". Then you have both comma and tab for delimiters in the data, but you only specify tab as the delimiter? If so, then the comma is just data, not a delimiter, so the result above is correct.

If you want the comma removed, specify it as a delimiter.

  parse/all str "^-," ; tab and comma

If you want a specific result that you're not seeing, please post it in the ticket.

rebolbot commented 15 years ago

Submitted by: BrianH

I think that "The quotes on the string make it a single atom." was the source of confusion. I guess it is a gotcha that needs documenting after all, particularly since fixing this would break the ability for the data to contain the delimiter in the quoted portion. We can start by marking this as not a bug.

rebolbot commented 15 years ago

Submitted by: Sunanda

What I wanted (and the application needed) was for parse to break the input string at the tabs character, regardless of any other special characters -- like quotes or commas. I know my application's input cannot consist of strings with embedded tabs. And I did not want parse to use its initiative.

But (as Brian suggests) that would conflict with my wish in other applications where I'd expect parse to intelligently handle embedded tabs and/or commans in CSV files.

So (also as Brian has suggested offline) the real issues are:

 1. unclear mental model of parse's built-in logic when it encounters embedded delimited strings
 2. expectation that parse can handle all CSV files, when we really need a snazzy mezz like decode-csv to handle all
    the messiness and RFC#4180 specifications.

The gotcha's are:

 1. assuming parse does not have special handling for quotes
 2. assuming parse unaided can handle all possible CSV files.
rebolbot commented 15 years ago

Submitted by: BrianH

Well,

1. Simple PARSE's handling of quotes is mostly* consistent with RFC4180, and useful.
2. Handling all possible CSV files is unlikely for simple PARSE, since the differences are contradictory.

The rest sounds like a job for a DECODE-CSV mezzanine. Let's declare this a feature.

* Mostly:

According to R3:

>> parse {"hello""world^/",a} ","
== ["hello" "world^/" "a"]
>> length? parse {"hello""world^/",a} ","
== 3

According to http://tools.ietf.org/html/rfc4180 :

>> parse {"hello""world^/",a} ","
== [{hello"world^/} "a"]
>> length? parse {"hello""world^/",a} ","
== 2

Added a ticket for the above: #1079

rebolbot commented 15 years ago

Submitted by: Carl

I agree. I is handy if PARSE can deal with simplistic CSV formats.

For the heavy-duty create DECODE 'CSV data -- an R3 codec, rather than a separate function. That way you can build the encoder at the same time, and have a cool combo. (And yes, it should be possible to write it in R code.)