gowerc / diffdf

DataFrame Comparison Tool
https://gowerc.github.io/diffdf/
Other
42 stars 5 forks source link

Abstract report design #98

Open gowerc opened 2 months ago

gowerc commented 2 months ago

@kieranjmartin

In order to support the generation of html / md as well as allowing users to customise the output as they see fit I think we need to abstract out the report creation to some light weight template structure.

My initial thoughts are something like follows:

Report
    - Components: a list of ReportComponent's
    - title: string

ReportComponent:
    - id: string
    - order: number
    - content: list of tags

tags are then just html style content tags. I would keep it minimal so perhaps just: h1, h2, h3, h4, p, table

order is so that the components appear in the correct order in the report e.g. separates report ordering from the literal object / list order which would make it easier for users to append / prepend additional components.

id is so that users could extract content by name again without having to worry about order in which the list is constructed.

As an example one of our tests would thus produce something like:

ReportComponent(
    id = "rows_in_base_not_compare",
    order = 500,
    content = ComponentBody(
        tag_p("Some rows were found in base that do not appear in compare"),
        tag_table(rows_in_base_not_compare)
        tag_p("Only 10 rows shown of x")
    )    
)

If users want to modify a component they could then do something like:

report["rows_in_base_not_compare"]<- append(
    report["rows_in_base_not_compare"],
    ComponentBody(tag_p("some additional footnote"))
)

Or if they wanted to inject an entirely new section:

report <- append(
    report,
    ReportComponent(
        id = "header_info",
        order = 10,
        content = ComponentBody(
            tag_p("Some additional info that I want to print into the report"),
        )    
    )
)
gowerc commented 1 month ago

Ok put a bit more effort into trying to get the UI/UX to be as smooth as the possible so have tweaked a few bits compared to what I posted above. I also have an initial first draft of everything I'm currently sharing on the branch 98-abstract-report

So I've now modified it so that report and report_chunk (formally Report and ReportComponent) are now just essentially lists with classes applied. This should make indexing a lot easier. the order value still exists on report_chunk but is now an attribute instead. Apparently classes get dropped when using append so I introduced a helper function combine, the added benefit is combine can handle all combinations of report / report_chunk / tag_element upgrading the types where needed.

So yer some examples of what I've currently got:

x <- report(
    "SB" = report_chunk(
        tag_h1("Section B"),
        tag_table(iris),
        .order = 10
    ),
    report_chunk(
        tag_h1("Section C")
    ),
    "SA" = report_chunk(
        title = tag_h1("Section A"),
        table = tag_table(iris),
        .order = 2
    )
)
render(x, row_limit = 2, type = "ascii") |> cat()
Section A 

First 2 of 150 rows are shown in table below

  ===============================================================
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species 
  ---------------------------------------------------------------
       5.1           3.5          1.4           0.2      setosa  
       4.9           3.0          1.4           0.2      setosa  
  ---------------------------------------------------------------

Section B 

First 2 of 150 rows are shown in table below

  ===============================================================
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species 
  ---------------------------------------------------------------
       5.1           3.5          1.4           0.2      setosa  
       4.9           3.0          1.4           0.2      setosa  
  ---------------------------------------------------------------

Section C 

Key points here are:

From here we can then have some examples of show how to manipulate the report object:

x[["SB"]] <- combine(
    x[["SB"]],
    tag_p("Some wrap up text")
)

tag_h1("Some random title") |>
    combine(x) |>
    combine(tag_p("Some footer")) |>
    render(row_limit = 2) |>
    cat()
Some random title 

Section A 

First 2 of 150 rows are shown in table below

  ===============================================================
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species 
  ---------------------------------------------------------------
       5.1           3.5          1.4           0.2      setosa  
       4.9           3.0          1.4           0.2      setosa  
  ---------------------------------------------------------------

Section B 

First 2 of 150 rows are shown in table below

  ===============================================================
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species 
  ---------------------------------------------------------------
       5.1           3.5          1.4           0.2      setosa  
       4.9           3.0          1.4           0.2      setosa  
  ---------------------------------------------------------------

Some wrap up text 

Section C 

Some footer 

Key points

bms63 commented 1 month ago

omg this is awesome! very exciting!

Any chance the functions could be appended with diffdf_ so easy to find?

bms63 commented 1 month ago

is there a way to control how many is outputted out? Sometimes I have 14 obs and I want them all rather than just 10 obs. So I guess an _all_ option would be nice as well.

kieranjmartin commented 1 month ago

I like this approach. I guess I would see this as an extension of construct_issue, where right now we construct an object with a value and the message embedded as an attribute.

I would think we would want to replace this with construct_issue returning a list with two components, the value and the report object.

I do wonder if we want to consider a more sophisticated approach with R6 rather than just building it on top of S3. In particular I am thinking that we may want to validate these elements if we expect that a user might want to modify them.

One thing we might want to think about is the object diffdf returns. I think the reason we put the print message as an attribute is that that makes the actual comparison object very easy to interrogate, but I think we can just add some methods to grab the pieces that we want to modify/access.

I really like the idea of diffdf being able to accept new issue constructors provided they follow our standards.

bms63 commented 1 month ago

I like the R6 idea but maybe that is for diffdf2? :)

just seems like a big refactor, but maybe not!

gowerc commented 1 month ago

@bms63 interesting you should say that, @kieranjmartin and I were just debating this exact issue the other day. We are wanting to make a bunch of backwards incompatible changes to diffdf in the next version (though admitedly the vast majority are internal changes we will try to retain as much of the external API as possible). I'm not entirely sure how such a change should be released.

I think @kieranjmartin's current position is just to release via this diffdf marking it as version 2. This is in order to not have to maintain both a diffdf and a diffdf2 in parallel.

I am torn between the two. I think I prefer the idea of just updating diffdf as per Kieran's preference but I am slightly conscious of the disruption this could potentially cause (depending on how many changes we do make).

bms63 commented 1 month ago

If disruption is minimal to users, then I am all for it!

I was just thinking that we could follow the ggplot/ggplot2 path? Keep diffdf around and stable. I don't think there have been a lot of patches over the years so minimal upkeep going forward. diffdf2 can go wild and start fresh without having to try and worry about users code as this is brand new package that they will need to install.

kieranjmartin commented 1 month ago

Haha @bms63 so for me ggplot2 is exactly the case I want to avoid.

It has always annoyed me that ggplot2 has the 2, despite ggplot not really existing anymore; it was archived in 2008!

My feeling is that if users want the earlier diffdf, it will still exist, and realistically if we did make a diffdf2, diffdf would get archived as soon as a breaking change happened on CRAN.

But I appreciate decisions do vary, and I know Hadley Wickham did this with plyr->dplyr as well! I'm not sure what he would recommend these days!