thscharler / spreadsheet-ods

Apache License 2.0
29 stars 6 forks source link

Allow row repetitions #13

Closed HermannDppes closed 3 years ago

HermannDppes commented 4 years ago

As I'd like to use this to generate files with content validation for entire columns, I think I need the row repetition (unless it's suddenly become fancy to explicitly write 2 billion rows). So, I've started to work on this feature.

As appears to be common, this PR primarily serves to track the work I'm doing and get feedback on it. If you don't want me to do it this way / don't want me to do this at all, just holler.

HermannDppes commented 4 years ago

Before I fix the read and write capabilities I should probably get some sleep but I'm aware that the write function merely compiles but became a mess due to my meddling.

thscharler commented 4 years ago

hm, not sure I like the way you started.

nevertheless, what kind of validation do you need for this? I've been looking at validation only very cursorly yet. maybe there is a better way than adding another layer of indirection?

thscharler commented 4 years ago

You want to use "table:content-validation-name"?

thscharler commented 4 years ago

Maybe this would be feasable: There is a RowColHeader struct with some fields. This could be split into RowHeader and ColHeader and we store the repeat-count for rows in the row header.

I'm not sure how the public API for this could look like but maybe we don't need much.

This would at least remove the additional layer for cell storage.

And this would solve a problem I had when reading row-repeats. They blew up the memory usage, so I ignored them for now.

HermannDppes commented 4 years ago

You want to use "table:content-validation-name"?

Yep, that and table:content-validations. This feature is mostly orthogonal to row-repetitions, I think, but will be of much more use to me when I have those.

maybe there is a[nother] way than adding another layer of indirection?

Yes, of course. The reasons I split this in the additional layer are the following: a) I found it easier to do. b) It is closer to the actual structure of the file format. c) It is easier for me not to mess up when I have more types, i.e. data for distinct purposes have distinct types. d) It felt like the cleaner separation of concerns, splitting them into different types instead of doing everything at once. e) Keeping the row-associated data with the rows seems far more straightforward to me than splitting that horizontally across different data fields. (If it had been me, I probably would have stored the row headers with the rows in an approach similar to my current one from the beginning.)

As you seem to be opposed to this: What are your reasons for choosing the approach with the single BTreeMap? It has not been listed with my reasons for the change, as it is not an argument for any particular approach, but I did only start to code it once I convinced myself that it would be reasonably easy to change everything back and I am honestly looking forward to learning your perspective.

I'm not sure how the public API for this could look like but maybe we don't need much.

There should be nearly no additional API needed, except „repeat the current last row (n) times“, at least for the beginning. In the future, additional features would be nice of course, e.g. a de-duplication feature that allows to programmatically detect (empty) rows with identical metadata and join them to a single repeated row.

Maybe this would be feasable: There is a RowColHeader struct with some fields. This could be split into RowHeader and ColHeader and we store the repeat-count for rows in the row header.

It is definitely possible. A small remark though: As columns can be repeated we would not need to split the structures for that.

thscharler commented 3 years ago

This has been implemented differently now.