Proposal: introduce Decoder DisallowMoreRecords option

velart commented 1 year ago

Hi @jszwec!

Thank you for the wonderful library!

Proposal preambula:

Client of library have historically generated CSV files. Client need to process these CSV files, but do not need all information from them. Client only needs information from first N out of M (original) number of columns.

Possible solution: Rewrite headers in CSV file, keeping only first N out of M number of columns. If CSV file is w/o headers, then use custom headers with first N number of columns.

Changes in code: Add DisallowMoreRecords (or AllowMoreRecords) option to decoder. Then rewrite following function:

func (d *Decoder) decodeStruct(v reflect.Value) (err error) {
    d.record, err = d.r.Read()
    if err != nil {
        return err
    }

    if len(d.record) != len(d.header) {
        return ErrFieldCount
    }

    return d.unmarshal(d.record, v)
}

Like this:

func (d *Decoder) decodeStruct(v reflect.Value) (err error) {
    d.record, err = d.r.Read()
    if err != nil {
        return err
    }

    if len(d.record) != len(d.header) {
        if d.DisallowMoreRecords || len(d.record) < len(d.header) {
            return ErrFieldCount
        }

        d.record = d.record[:len(d.header)]
    }

    return d.unmarshal(d.record, v)
}

What do you think?

P.S. If proposal is approved, I may do the PR.

jszwec commented 1 year ago

This check exists to prevent a situation such as this one:

col1,col2,col3
1,2,3
1,2
3

It's hard to reliably map these to struct fields. What kind of guarantee I have that in the last record (3) is actually col1?

Adding your change is risky because of the above and other situations such as when the header changes. I believe that the caller should use lower level csv.Reader in ambiguous situations like this

If you dont care about extra columns, that's fine, you dont need to define them in your struct. However, your csv file must be consistent

velart commented 1 year ago

@jszwec ,

This is more about situation like this:

col1,col2,col3
1,2,3,4
1,2,3,5,6
1,2,3,

Code will still return ErrFieldCount for case you mentioned. Although I can agree that CSV file should be consistent.

To be honest in my case there is sometimes some garbage written after last column value. But values before that are still precious and can not be ignored. In the end I agree that my case is too specific and I am trying to cure symptoms here. So I am ok with proposal being declined.

jszwec / csvutil

Proposal: introduce Decoder DisallowMoreRecords option #57