Tighten trimming for ExpandUntil and ExpandBetween

muesli4 / table-layout

Layout data in grids and pretty tables. Provides a lot of tools to get the cell formatting right (positional alignment, alignment on specific characters and limiting of cell width)..

BSD 3-Clause "New" or "Revised" License

37 stars 11 forks source link

Tighten trimming for ExpandUntil and ExpandBetween #53

Open Xitian9 opened 1 year ago

Xitian9 commented 1 year ago

Fixes #35

muesli4 commented 1 year ago

Before I dive into this, would you mind giving a simple example (with code) that shows the behavior before and after?

Xitian9 commented 1 year ago

Behaviour before

Github messes up the spacing a little bit, but note that there are three spaces to the right of Bob and Sue, and two spaces to the right of the Sun Yat-Sen's names.

ghci> import Text.Layout.Table.Cell.WideString
ghci> putStrLn . tableString .
  columnHeaderTableS [ column (expandUntil 5) left noAlign noCutMark ] unicodeRoundS (titlesH ["Text"]) $
  [colsAllG top $ [map WideString ["Bob", "Sue", "孫德明", "孫帝象", "孫載之" ]]]
╭───────╮
│ Text  │
╞═══════╡
│ Bob   │
│ Sue   │
│ 孫德  │
│ 孫帝  │
│ 孫載  │
╰───────╯

Behaviour after

Note that there are now only two spaces to the right of Bob and Sue, and one space to the right of Sun Yat-Sen.

ghci> import Text.Layout.Table.Cell.WideString
ghci> putStrLn . tableString .
  columnHeaderTableS [ column (expandUntil 5) left noAlign noCutMark ] unicodeRoundS (titlesH ["Text"]) $
  [colsAllG top $ [map WideString ["Bob", "Sue", "孫德明", "孫帝象", "孫載之" ]]]
╭──────╮
│ Text │
╞══════╡
│ Bob  │
│ Sue  │
│ 孫德 │
│ 孫帝 │
│ 孫載 │
╰──────╯

Explanation

Since Sun Yat-Sen's names each consist of three wide characters they are of width 6, which is larger than our requested maximum of 5. The therefore asks code would request that each string be trimmed down to width 5. However, none of Sun Yat-Sen's names can be trimmed to 5 characters: the closest you can get is 4.

Previously, it would trim them to 4 and then add 1 space of padding to make it up to 5 characters. But since none of the trimmed items are really of length 5, we can get better spacing by not adding the extra padding. This is what the code does now: it recognises that we really are only a column of width 4, and renders it that way.

Xitian9 commented 1 year ago

Here's another example with cut marks

Before

ghci> import Text.Layout.Table.Cell.WideString
ghci> putStrLn . tableString .
  columnHeaderTableS [ column (expandUntil 4) left noAlign ellipsisCutMark ] unicodeRoundS (titlesH ["Text"]) $
  [colsAllG top $ [map WideString ["Bob", "Sue", "孫德明", "孫帝象", "孫載之" ]]]
╭──────╮
│ Text │
╞══════╡
│ Bob  │
│ Sue  │
│ 孫 … │
│ 孫 … │
│ 孫 … │
╰──────╯

After

ghci> import Text.Layout.Table.Cell.WideString
ghci> putStrLn . tableString .
  columnHeaderTableS [ column (expandUntil 4) left noAlign ellipsisCutMark ] unicodeRoundS (titlesH ["Text"]) $
  [colsAllG top $ [map WideString ["Bob", "Sue", "孫德明", "孫帝象", "孫載之" ]]]
╭─────╮
│ …xt │
╞═════╡
│ Bob │
│ Sue │
│ 孫… │
│ 孫… │
│ 孫… │
╰─────╯

Xitian9 commented 1 year ago

In general, I am not opposed to these changes. The problem I am having is with the logic for measuring that is intertwined with the logic for building. And to me that indicates that the problem is not fully understood or not properly represented. If it cannot be represented that way, I would like to know the reasoning.

That's completely reasonable, and I'm not sure I've found the optimal approach. I'm interested in your feedback and thoughts.

The reason I've gone with this approach is that calculating the padding needed is almost all of the work of actually building. To figure out whether any padding is needed, you need to try attempt to trim the requested amount, and then check to see whether you had to trim more than requested. At this point, you build by adding the extra padding. So the keeping track of the built cell without padding, as well as how much padding is needed, makes either simple:

To build with padding, just add the padding with spacesB (as in the cellViewToPadding function)
To build without padding, discard the padding, or use fmap to operate on the underlying builder.

For all the examples we have so far, checking whether padding is needed involves dropping one character (or unit, for ElidableList) at a time and seeing what width you've actually dropped. This is computationally intensive, and so I'd prefer not to duplicate if possible.

Xitian9 commented 1 year ago

Perhaps a word on how I view this conceptually.

The Cell typeclass is one that knows how to do 3 things:

Measure its own length
Measure the position of the first (or last) character matching a predicate
Trim itself

I have explicitly left out padding, because padding is not something that Cell knows how to do. The StringBuilder class is where all the padding takes place. buildCell is the bridge between the two that makes it look like Cell can pad, but really it leaves that work to StringBuilder.

For most instances of Cell trimming and padding are naturally orthogonal, but WideString and others break that paradigm. Since trimming decreases length monotonically, but is not surjective onto all numbers less than its length, some trimming requires padding to get the exact length requested. buildCellViewTight restores the orthogonality of these two operations: it will only trim, and if any padding is required it records that in CellView.

That said, I haven't been as principled about this philosophy as I could be: buildCellViewTight will add padding if it is explicitly requested with a positive number in CellView. However, if only trimming is requested it will not pad. Maybe it's worthwhile being more principled here.

muesli4 commented 1 year ago

The thing is, you do the scanning for multi-width characters twice anyway. I think we should treat efficiency as a different issue for now. If it is necessary later there is other stuff you can do (mostly caching in WideString and WideText, but I don't think it is that expensive that it will ever matter).

To me it seems the only thing that changed with the introduction of multi-width characters is that if we drop a character, several width units are potentially dropped at once, and not just one.

I have explicitly left out padding, because padding is not something that Cell knows how to do. The StringBuilder class is where all the padding takes place. buildCell is the bridge between the two that makes it look like Cell can pad, but really it leaves that work to StringBuilder.

The way I see it, neither Cell nor StringBuilder do the padding. Cell, with the new changes, just happens to implement applying the padding from CellView. But it actually happens here:

Figure out the width range of the column and create an abstract description named ColModInfo.
Potentially alter or use ColModInfo.
Interpret ColModInfo (trimming, padding, et cetera).

Now, I still do not see why we cannot make our existing code aware of the new dropping behavior. If it is not possible at all, the cause has to be identified and we need to figure out a new architecture.

Xitian9 commented 1 year ago

The thing is, you do the scanning for multi-width characters twice anyway. I think we should treat efficiency as a different issue for now. If it is necessary later there is other stuff you can do (mostly caching in WideString and WideText, but I don't think it is that expensive that it will ever matter).

Okay, that is a reasonable decision. I'lll go along with it.

To me it seems the only thing that changed with the introduction of multi-width characters is that if we drop a character, several width units are potentially dropped at once, and not just one.

Yes, that is my take on it too.

I have explicitly left out padding, because padding is not something that Cell knows how to do. The StringBuilder class is where all the padding takes place. buildCell is the bridge between the two that makes it look like Cell can pad, but really it leaves that work to StringBuilder.

The way I see it, neither Cell nor StringBuilder do the padding. Cell, with the new changes, just happens to implement applying the padding from CellView. But it actually happens here:
1. Figure out the width range of the column and create an abstract description named `ColModInfo`.

2. Potentially alter or use `ColModInfo`.

3. Interpret `ColModInfo` (trimming, padding, et cetera).
Now, I still do not see why we cannot make our existing code aware of the new dropping behavior. If it is not possible at all, the cause has to be identified and we need to figure out a new architecture.

By ‘our existing code’, you mean can we do this without introducing any new functionality into the typeclass? I'm sceptical that it's possible.

The existing code assumes that if you ask it to drop a certain width, it will be able to drop exactly that width. Namely, it satisfies the following invariant: visibleLength (CellView l r a) == visibleLength a + l + r

This is why it's able to do everything with a single call to visibleLength at the beginning, and then determine the end length using arithmetic. Any trimming/building will result in a StringBuilder, which does not have a way of measuring length. So once we've trimmed a Cell, we need a way of knowing what length we really ended up with, and how much of it was just padding so that our invariant holds.

There are two ways I can see doing this: my buildCellViewTight, which explicitly records how much space is needed, and your proposal for a visibleLengthAfterDrop. They both record the same information in different ways.

(Note: I'm being a bit sloppy here with length, but I think my point is clear. Let me know if it isn't.)

It sounds like your call is that we go with some variation of visibleLengthAfterDrop. I'm not a fan of the name, but here are some proposals:

visibleLengthWithoutPadding :: Cell a => Int -> Int -> a -> Int  -- Questions: Does it take negative numbers for dropping (like CellView) or positive numbers? Does it ignore the other one?
visibleLengthWithoutPadding :: Cell a => CellView a -> Int  -- Questions: Does it ignore requested padding?

My vote is for the second, and that it completely ignore requested padding.

Xitian9 commented 1 year ago

My vote is for the second, and that it completely ignore requested padding.

I'm like to change my opinion to be that it should not ignore requested padding. This will enable the default implementation to be the straightforward visibleLengthWithoutPadding (CellView a l r) = visibleLength a + l + r. The name should probably change though: perhapse visibleLengthTight?

muesli4 commented 1 year ago

I do not care too much about the name. Small details are easily changed. I'm more interested in how the implementation will change and if there is something coming about that we did not plan.

Xitian9 commented 1 year ago

I've attempted to redo this PR using the visibleLengthWithoutPadding approach. It's feasible, but the last step of this PR becomes significantly more complicated.

The advantage of buildCellViewTight :: StringBuilder b => CellView a -> CellView b is that it doesn't add the padding in, and just keeps it around. This is an advantage when we want to start adding cut marks. As it currently stands in master, cut marks added to wide strings are added after the padding, e.g. 孫 … instead of 孫…. Since adding cut marks happens at the StringBuilder level, we need a way to build the Cell without adding the extra padding… which is precisely what buildCellViewTight does.

As I see it, we need to introduce functions to the Cell typeclass with the following requirements:

Can build to StringBuilder, but has to keep track of any padding necessary so it can be added after cut marks: buildCellViewTight :: StringBuilder b => CellView a -> CellView b.
Can determine its minimal length after trimming (without padding), so columns can be kept tighter if necessary: visibleLengthWithoutPadding :: CellView a -> Int

The second can be naturally implemented in terms of the first. The first can be implemented in terms of the second, but it's a bit clunky. I've done a lazy job here.

visibleLengthWithoutPadding a =
    visibleLength a - totalAdjustment (buildCellViewTight a :: CellView String)
buildCellViewTight a@(CellView x l r) =
    CellView (buildCellView $ CellView x (l - extraPadding) r) extraPadding 0
  where
    extraPadding = visibleLength a - visibleLengthWithoutPadding a

I think we should stick with the implementation with buildCellViewTight. If you'd like, I'm happy to add a function (either a typeclass function or a standalone function) as defined above. Let me know your thoughts.

Xitian9 commented 1 year ago

Let me try to summarise this second padding problem in a different way. Let me know if you agree.

In the current paradigm, there are three things that you can do with an instance of Cell (among others).

Measure the length with visibleLength.
Drop a specified width from the left/right with dropLeft/Right/Both.
Build to a StringBuilder with buildCell.

These are kept separate conceptually. Note that there is no concept of padding inherent in Cell. Any padding is implemented using non-typeclass functions like pad, trimOrPad, or the column alignment functions.

The current architecture is able to maintain this separation because it assumes that you can drop any positive integral width from either side of the Cell without issue. The length to drop is calculated by saying length to drop = visibleLength x - desired length, and visibleLength (dropBoth l r x) = visibleLength x - l - r.

Wide characters complicate this because there exist lengths you can't drop. For example, you cannot drop length 1 from a string consisting of a single width-2 character. In order to maintain the current architecture, we need to fake it by dropping more width than necessary, and then ‘topping up’ by adding extra padding to get to the desired width. By adding this extra padding, we can maintain the above invariants.

But this complicates the architecture, because it requires some sort of padding to be inherent in the definition of Cell, something which was not present before. The solution in master requires this to be handled within the typeclass definitions by explicitly adding padding in buildCellView.

However, this causes a problem. It means padding happens in a new place in the code, and this doesn't interface well with existing architecture. For example, when we call a function like trim, it will trim to the desired length, build to a StringBuilder, and add a cut mark. But since padding was added at the build stage, as necessitated by the current architecture, the cut mark gets added after padding, which is not the order we want.

There are a few ways I can see out of this problem:

Don't add padding at the buildCellView stage, but instead keep track of how much padding would be needed to get the desired width. Existing functions can be converted to this approach fairly easily by using the Monad instance of CellView. This is the approach taken by buildCellViewTight.
Make the CellView typeclass aware of CutMarks, and integrate higher-level functions like trim as typeclass methods.
For functions like dropLeft, instead of taking an argument specifying how much to drop, take an argument specifying the desired final length. In this case, dropping functions would need to be aware of length, so this would weaken the separation between dropLeft and visibleLength.

I continue to think that option 1 is the best approach.

muesli4 commented 1 year ago

Thank you for the summary. I feel that I do not fully understand everything and first need to dive back into the code. I have some vacation soon, so I hope to get this review done at around Christmas.

muesli4 commented 1 year ago

I found this issue to be complex so I am trying to structure my response into sections.

General

There are a few ways I can see out of this problem:

1. Don't add padding at the `buildCellView` stage, but instead keep track of how much padding would be needed to get the desired width. Existing functions can be converted to this approach fairly easily by using the `Monad` instance of `CellView`. This is the approach taken by `buildCellViewTight`.

2. Make the `CellView` typeclass aware of `CutMark`s, and integrate higher-level functions like `trim` as typeclass methods.

3. For functions like `dropLeft`, instead of taking an argument specifying how much to drop, take an argument specifying the desired final length. In this case, dropping functions would need to be aware of length, so this would weaken the separation between `dropLeft` and `visibleLength`.

I continue to think that option 1 is the best approach.

I think neither 2. or 3. can handle the actual complexity. In general, I agree with your assessment, that adding padding or trimming in buildCell is a bad idea. We should integrate the minimum amount of logic necessary into the Cell typeclass. It is just about cells themselves.

Issue

In my opinion, the crux of this issue lies in ColumnModifier. If we can somehow integrate the column headers into the related functions, I agree in getting rid of it. The reason why I think this, is because we now need more than one iteration to find the actual width of a column. I.e. dropping more than one width unit may change the whole width to a smaller width (even below the minimum threshold, e.g., in ExpandBetween or FixedUntil).

Approach

I try to describe how I imagine it. It is very likely to overlap with what you already implemented or wrote. Thus, I don't want to claim this is something completely new or fully my idea.

Measurement phase
1. Measure initial width of columns (including headers).
2. Use the column modification functions to compute the adjustment for each cell (may be represented as CellView).
3. Optional: If, for an instance, visibleLength c == length c, we could even skip the remaining steps.
4. For each adjustment we can simply subtract the minimum of the column.
5. Ensure that it fits the LenSpec.
Construction phase: Apply the computed adjustments to the cells (i.e. padding, trimming, applying cut marks).

Details

I would like to keep using StringBuilder out of the measurement phase.
Cell should receive new functions dropLeftLengthUnits :: Cell c => c -> Int -> (Int, DropAction) and and equivalent dropRightLengthUnits :: Cell c => c -> Int -> (Int, DropAction). The first result is the minimum amount of width units that are necessary to be dropped to achieve the given width units. The second result is an instance-specific type that describes the action necessary to drop from the cell. This enables us to avoid the visible length calculation a second time when we want to build the cell.
- For WideString this could be simply an integer that specifies how many actual characters are dropped.
- For String this is the same as the visible length.
- Other types can use arbitrarily complex data types.
Functions that are used in interpreting column modifiers are now applied directly instead of interpreting ColumnModifier ("column modification functions"). However, they will now compute the adjustment for a cell.
We need include cut marks in the computations.
We need to replace ensureWidthOfCMI.

How to move forward

Please let me know whether you agree with my assessment of the situation or see any issues.
Move unrelated commits into a separate MR so that we can get them out of the way and into master:
1. ffb3456f4dccfcf2627d257286530c749aba57f4 Simplify and generalise helpers for buildCell
2. 70d15259cf51f1b0d377c2ed8d4c98239d84a7ca Clean up compiler warnings
Check what we can reuse of the functionality of the other commits.
1. I don't see any use case for buildCellViewTight. I.e. building is a step that should happen at the end.
2. Some of the work that applies to column modification functions may be reused in the construction phase.
Check how that might interfere with ElidableList or other types in general.

If we agree on this proposal, then we can talk about who does the work or where we could collaborate. Technically, I wouldn't mind working on it and I still have a few days of vacation.

muesli4 commented 1 year ago

I will probably start with minor refactorings:

Split column modification functions into measurement and building phase parts.
Prepare for removing ColumnModifier.

Xitian9 commented 1 year ago

This is a very detailed analysis and proposal. At first read-over it seems reasonable to me, but I'll need to think a bit more deeply (and probably try to implement bits of it) to know for certain. In the meantime, I'll factor out the unrelated commits and submit a separate PR.

Xitian9 commented 1 year ago

How should we proceed? Are you intending to implement this, or should we divide up the work in a reasonable way?

muesli4 commented 1 year ago

How should we proceed? Are you intending to implement this, or should we divide up the work in a reasonable way?

If you are willing to help, what would probably make the most sense for you is adding the new Cell type class functions dropLeftLengthUnits and dropRightLengthUnits, as described above, and implement all instances. Before you start please check if it makes sense to you and report back if you see any issues. If you feel more comfortable with something else, let me know.

What I did so far is introduce a type that tracks the length instead of building it instantly and adapt the code. This may or may not be the best representation but we can change that easily later on.

> trimOrPad left ellipsisCutMark 5 (WideString "孫德明")
CellMod {baseCellCM = WideString "\23403\24503\26126", leftAdjustmentCM = 0, rightAdjustmentCM = -2, leftCutMarkLenCM = 0, rightCutMarkLenCM = 1}
> trimOrPad left ellipsisCutMark 5 (WideString "Sue")
CellMod {baseCellCM = WideString "Sue", leftAdjustmentCM = 0, rightAdjustmentCM = 2, leftCutMarkLenCM = 0, rightCutMarkLenCM = 0}

The following parts are still missing.

The logic to adjust the width (i.e. to eat up any superfluous space): Find unnecessary padding of all CellMods in a column and then building. This may be done in one iteration with laziness.
The code it is embedded in needs to be changed a bit to get to the actual columns. I will have to look at it again.
Adapt the code to use the new type class functions.

Xitian9 commented 1 year ago

I'm looking into this now, but my brain is having a bit of trouble wrapping itself around this. Could you give a quick example of how you'd implement and use this?

Cell should receive new functions dropLeftLengthUnits :: Cell c => c -> Int -> (Int, DropAction) and and equivalent dropRightLengthUnits :: Cell c => c -> Int -> (Int, DropAction). The first result is the minimum amount of width units that are necessary to be dropped to achieve the given width units. The second result is an instance-specific type that describes the action necessary to drop from the cell. This enables us to avoid the visible length calculation a second time when we want to build the cell.

For WideString this could be simply an integer that specifies how many actual characters are dropped. For String this is the same as the visible length. Other types can use arbitrarily complex data types.

muesli4 commented 11 months ago

I'm looking into this now, but my brain is having a bit of trouble wrapping itself around this. Could you give a quick example of how you'd implement and use this?

Cell should receive new functions dropLeftLengthUnits :: Cell c => c -> Int -> (Int, DropAction) and and equivalent dropRightLengthUnits :: Cell c => c -> Int -> (Int, DropAction). The first result is the minimum amount of width units that are necessary to be dropped to achieve the given width units. The second result is an instance-specific type that describes the action necessary to drop from the cell. This enables us to avoid the visible length calculation a second time when we want to build the cell.

For WideString this could be simply an integer that specifies how many actual characters are dropped. For String this is the same as the visible length. Other types can use arbitrarily complex data types.

I hope the following makes sense.

instance (Cell String) where
  type DropAction String = Int
  dropLeftLengthUnits c n =
    -- Actual characters are visible characters.
    let n' = max 0 (visibleLength c - n) in (n', n')
  visibleLength = length

instance Cell c => Cell WideString where
  type DropAction WideString = Int
  dropLeftLengthUnits c n =
    -- Here you determine how many actual characters need to be dropped
    -- to get to the desired visible length, and how many visible characters
    -- can actually be dropped.

    -- Later on, the DropAction may be used to do the dropping without measuring again.

instance Cell c => Cell (Formatted c) where
  type DropAction (Formatted c) =
    -- type that describes how to drop in a tree