j3-fortran / fortran_proposals

Proposals for the Fortran Standard Committee
178 stars 15 forks source link

Treat tab characters in list-directed input as a separator like a comma #141

Open arjenmarkus opened 4 years ago

arjenmarkus commented 4 years ago

Currently, tab characters read via list-directed input are not considered special in contrast to commas (or semicolons, if the file that is being read is opened with DECIMAL='COMMA'). Tab characters are very commonly used, however, to separate fields - for instance in CSV files. It would be nice if a tab character would be treated in the same way as a comma/semicolon. More specifically: a single tab separates two fields (numerical or character or ..), two consecutive tabs separate an empty position and the corresponding variable in the READ statement would not be changed, just as with two consecutive commas. In fact, the treatment of a tab should be exactly the same as a comma (or semicolon). This will introduce a backward incompatibility., To avoid that, the OPEN statement could be used in the same way as the keyword DECIMAL changes the meaning of a comma to attach this meaning to the tab character. Something along the lines of: OPEN( NEWUNIT=lun, FILE='input.csv', TAB='SEPARATOR' ) Alternatively: TAB='CHARACTER' to indicate no special treatment of tab characters should be applied. This would be the default.

sblionel commented 4 years ago

I assume you know that tab isn't even in the Fortran character set at present. I think most implementations allow tab in list-directed input as equivalent to spaces, so this generally works (and indeed I use it in my own programs) with the exception of the omitted value. That seems like a very strange corner-case that could end up causing more confusion than help since, outside of CSV files, multiple tabs just appear as whitespace.

In all of the requests I've seen asking for language help with CSV files, I never ran into this. How common is it to have a CSV file with an omitted value?

If we were to do this at all, I would restrict it to an I/O control list specifier on a list-directed (or NAMELIST?) READ only.

everythingfunctional commented 4 years ago

@sblionel , missing data in CSV files is very common in data science/big data applications. Whether the language should be changed to accommodate edge cases in specific data formats for specific applications is a different question.

arjenmarkus commented 4 years ago

@sblionel, yes, I know tabs are not part of the character set ;). Yet they are ubiquitous. Files with missing values I have to deal with regularly have. for instance, a column indicating that a measured concentration value is below the detection limit. In that case the field contains a "<". If the value is well above it, the field is simply empty, so exporting the data to a CSV file results in two consecutive commas or tabs. Perhaps my request that they be treated in this way is a bit overzealous as @everythingfunctional points out. At least a guaranteed treatment - in whatever way - of tabs seems useful enough.

richardbleikamp commented 4 years ago

I agree with Steve that limiting this to an I/O list control specifier makes sense.

Perhaps a more general capability, to allow the user to specify multiple separator values, possibly even changing "," (or ";") to NOT be considered a separator (what about blanks???), would make more sense. A possible problem with this is accommodating different character KINDs, so this might be limited to input records that only contain default kind characters.

READ(..., SEPARATORS=", ;\t"), where "\t" is not what the user would type, a real TAB character (of default kind) would be what the user would type. Only affects list directed input of records containing solely default kind characters.

  1. if a blank is included in the SEPARATORS value, what do consecutive blanks do? Perhaps blanks should always be a separator?

  2. does a SEPARATORS character in a delimited (or non-delimited) character input value affect anything? Probably not, that is, its not a separator if in a delimited character string in an input record.

  3. perhaps this capability should only add new separators, not affect blank, comma, semicolon, and possibly be limited to control characters (not a well defined thing in the standard), or non-alphanumeric, not .+='"()*&%$#@! etc??? (exclude every character in the Fortran char. set?), this would still allow horizontal and vertical tabs, BELL, etc.


The user can do this themselves now (in F2018?), if we assume a list directed READ only reads ONE record from the input file (true for .csv files I think). Something like:

READ(unit, *,TAB="separator") x,y,z

could be written as

BLOCK
    CHARACTER (:), ALLOCATABLE :: newrec; INTEGER p
    READ (unit, FMT="(A999999)") newrec   ! replace 999999 with reasonable max line length
    p = scan(newrec, "\t")              ! replace \t with actual TAB character
    do while(p>0)
        newrec(p:p) = ",";    p = scan(newrec, "\t")
    end do
    READ(newrec, *) x,y,z;  deallocate (newrec)
END BLOCK

So, I think a quick straw vote on: (where "we" is J3) 1) should we support TAB="separator" in a READ stmt, OR 2) should we support SEPARATOR="xxx", OR 3) should we do nothing is appropriate for meeting 221 or later, if Dan allows discussion

Comments?

sblionel commented 4 years ago

My suggestion for now is to not over-specify it. It is sufficient to provide a general description of the problem and use cases. A normal process would have each request written up as a separate J3 paper, but I think what would work here is a single document with links to each of the threads and let J3 (and WG5!) members review the whole of the discussions. This would also provide some record that could live on the J3 site - the paper could be amended to include results of any discussions.

Do keep in mind that, ultimately, it is WG5 that decides whether or not to include a feature, though a favorable recommendation from J3 goes a long way.

arjenmarkus commented 4 years ago

@richardbleikamp in the "tokenizer" modules in my Flibs project I explicitly consider delimiters and separators - two consecutive delimiters mean there is an empty string in between and separators work just as spaces.