Closed yuki2006 closed 1 year ago
This was mentioned in #37.
I do not believe this can be implemented in a reliable way. There are other edge cases not just with extra fields but also missing fields. Please refer to my comment in #37 for further context
Thank you for your response.
I have reviewed the comments you pointed to. In real-world scenarios with CSV files, there can be cases where rows may have a varying number of columns, so there may be times when we want to ignore parts that are not tagged. An option to ignore those would be very appreciated.
Additionally, in my opinion, if there are fields missing compared to the header, it might be acceptable for that to result in an error. It would help to ensure consistency in the data, while still allowing for flexibility where needed.
Thank you once again for your attention to this matter. I look forward to any updates or additional insights you may have.
Best regards,
Additionally, in my opinion, if there are fields missing compared to the header, it might be acceptable for that to result in an error. It would help to ensure consistency in the data, while still allowing for flexibility where needed.
In that case I am a little confused what you are asking for.
For:
type Record struct {
Name string `csv:"name"`
}
name,items
alice,apple
bob,banana,orange
Is wrong not because of Items
is not part of the struct, but there is no third column in the header for orange
and that means that this data needs to be handled in a specific way. We can't automatically decide that banana
is an item, or orange
is.
csvutil is not supposed to handle all use cases. For as long as we can guarantee correctness I am open to ideas.
Thank you once again for engaging with this discussion.
I'd like to share a personal thought regarding the handling of rows where the number of columns doesn't align. In cases like that, it might be beneficial to have an option that interprets the CSV as if it were written with trailing commas to align the columns, such as in the example below:
name,items,,
alice,apple,,
bob,banana,orange
It might be beneficial to have an option that treats such rows in a consistent manner, interpreting them as though the additional fields are simply empty, similar to trailing commas.
Of course, this is just my personal perspective.
I am not convinced. This feels like a Parser's responsibility, and csvutil is not a parser. Since csvutil is based on the interface, you can write a wrapper around csv.Reader to fill out the gaps so csvutil can understand it
I understand indeed. While I would be happy if this could be handled by a library, I'm thinking of achieving this with the following code as a temporary measure. From here, I leave it up to the author.
type Record struct {
Name string `csv:"name"`
}
type NormalizeAlignCSV struct {
buf [][]string
pointer int
}
func (n *NormalizeAlignCSV) Read() ([]string, error) {
if n.pointer >= len(n.buf) {
return nil, io.EOF
}
v := n.buf[n.pointer]
n.pointer++
return v, nil
}
func NewNormalizeAlignCSV(buffer io.Reader) *NormalizeAlignCSV {
csvReader := csv.NewReader(buffer)
csvReader.LazyQuotes = true
csvReader.FieldsPerRecord = -1
csvReader.ReuseRecord = true
all, err := csvReader.ReadAll()
if err != nil {
log.Fatal(err)
}
max := 0
for _, rows := range all {
if max < len(rows) {
max = len(rows)
}
}
for i, rows := range all {
all[i] = append(rows, make([]string, max-len(rows))...)
}
return &NormalizeAlignCSV{
buf: all,
pointer: 0,
}
}
func main() {
bufferString := bytes.NewBufferString(`name,items
alice,apple
bob,banana,orange`)
dec, err := csvutil.NewDecoder(NewNormalizeAlignCSV(bufferString))
if err != nil {
log.Fatal(err)
}
for {
var record Record
if err := dec.Decode(&record); err == io.EOF {
break
} else if err != nil {
log.Fatal(err)
}
log.Printf("%+v", record)
}
}
your particular example is impossible to achieve in csvutil with determining max
ahead of time.
csvutil is a streaming library, so we cannot call ReadAll. This is also very inefficient.
I would approach it like this:
Go playground: https://go.dev/play/p/9rsuyaq8Yf-
That's a great idea! The idea of adjusting the slice to match the header rather than the maximum column is something I overlooked, and it can be handled with a stream
@yuki2006 I gave it another thought and decided to add it as a flag. The code is already on the master branch and I will create a new release soon.
@jszwec Thank you! I'm glad you implemented the feature for me.
Description
While using the csvutil library to decode CSV data, I've come across an issue when a CSV row contains more fields than what is defined in the corresponding struct, even with csvReader.FieldsPerRecord = -1. The library currently appears to require that all rows must match the number of fields defined in the header, even if some fields are not represented in the struct.
Example
Expected Behavior
I expect the library to ignore extra fields in the CSV data that are not represented in the struct, without resulting in an error.
Actual Behavior
An error is encountered when trying to decode the row with extra fields (bob,banana,orange).
Potential Solution
A possible enhancement could be to allow the library to ignore unmapped fields in the CSV data, making handling inconsistent CSV data more robust.
Additional Context
This behavior can pose challenges when working with real-world CSV data, where some rows may have inconsistent numbers of fields. Being able to gracefully handle such inconsistencies would be a valuable feature.
Thank you for this convenient library. Please let me know if further details or clarifications are required. Your attention to this matter is greatly appreciated!