Move to database (YAML?) for source data?

rubyreferences / rubychanges

Comprehensive changelog of Ruby Programming Language

https://rubyreferences.github.io/rubychanges/

195 stars 39 forks source link

Move to database (YAML?) for source data? #46

Open Phrogz opened 2 years ago

Phrogz commented 2 years ago

I'd love to be able to consume this fantastic information you've created and provide an alternative visualization for it. For that, it would be far easier to consume a database that has discrete change entries with fields like: version, category, title, class, method, summary, overview, reason, discussion, documentation, example_code, notes, and so on.

The contents of many of these fields could/would be Markdown, and the full Markdown or HTML as present today could be generated from them.

Pros

People (like me) could consume the core data programmatically more easily
I think it would be more clear how to add new features in a consistent manner.
Would allow slicing the data different manners, e.g. creating a view that covers only the String class across all versions, or only the changes from 3.0 to 3.1.
(maybe not a pro?) The visual presentation would be forced to be consistent. For example, Notes: vs. Note:

Cons

A fair bit of scraping work (or manual conversion) would be need to convert the existing data
Removes the ability for custom presentation for a specific item, if desired. For example, if there was only one notes value per entry, or one example_code section, but you wanted two separate notes labeled separately.
Hand-editing Markdown-in-YAML does not have nice Markdown syntax highlighting.

Phrogz commented 2 years ago

Since I want the information anyhow, I decided to try converting to a database myself. The work in progress can be seen here, for your consideration. I've just started populating a master database file, have not yet worked on scripts to produce the MD or HTML currently available. It's definitely not a full replacement yet, but for consideration.

https://github.com/Phrogz/rubychanges/blob/dev/database-driven/_src/database.yaml

My rationale for some of the design:

I made the ~database file hierarchical so as not to repeat the version number or general category in every item. When I inject this data I plan to flatten the hierarchy, copying the version and section information into each entry and having them as a flat list.
The title of each entry should be concise and clear, so they can be understood out of context. I changed a few of your existing titles during manual conversion, e.g. appending the word "added" to titles where new methods were added.
The kind of each entry is one of: addition (new functionality or syntax that does not break old), change (old code cannot be used as it used to), promotion (making experimental features non-experimental) or removal (getting rid of old behavior). You might ignore this in your HTML, or might use it for icons categorizing the changes.
The highlight tag would be used to loft specific entries into a Highlights section. More granular ranking of "importance" may be desirable, e.g. to produce highlights-only, or highlights+generally useful, or highlights+useful+details few care about.
The summary is an explanation of the change beyond the title, but not as deep as the code sample or your reason.
Though I like your presentation of "Discussion" for features and bugs and pull requests, I think instead of the actual URL--which might change in the future--just the unique ID(s) should be present in an entry. Thus there is bug and feature and github-pull-request, which would all go under "Discussion", but whose links and labels would be generated based on which type it is.
Similarly for docs, I'm keeping just the URLs and planning on scraping the web pages to get the <title> for display purposes. Dumb? Maybe.
I added a class tag to each item to be able to search for just changes to specific classes. And maybe to help group changes. Some of these entries are sketchy, like using class: Method for changes to parameter calling.
As seen throughout, many of the fields (docs, feature, bug, class) can either have a single value or an array for when it suits.

Phrogz commented 2 years ago

Open questions:

Where you have general preamble prose for a release, should that be part of the database of changes, or a separate file perhaps?
You have some nice hyperlinks between entries. I removed some of those during my conversion, but I'd like to reinstate them. How to make an unambiguous references to entries in the files that don't rely on the generated generated HTML? I'd hate to rely either on array indexing of entries, or to have to type id: "foo-bar" into each and every entry.
- Maybe only add id when it's needed for a reference?

zverok commented 2 years ago

Thanks for raising the topic! More formal DB was my original wish, and you summarized its pros & cons beautifully. From the perspective of the person who needs to, well, author it all, the show-stopper (why I didn't start with YAML—and I intended to!) is the convenience of authoring and the humanity of the result. It basically falls into:

Highlight of Markdown in any editor
No strict structural requirement: as you've noticed, sometimes I believe that I need two examples, or one note, or many notes; the logical grouping and ordering of changes also adjusted from year to year, depending on my general outlook on what's related and what's meaningful

That being said, both can be mitigated, and I actually have some ideas about how it all could work (but nobody asked before, and I never had enough time myself!). The virtues of having it in a more formal structure and being able to autorender some slices, reorder, etc. are obvious (and actually while producing "Evolution," I did some small automation: that's why I left "not important enough" features in the source file, just commented them with : so next year it would be easier to compare automatically what's already in the file and include only missing things).

So, what I'd do in the direction of allowing the DB-alike usage while preserving humanity and authoring convenience (and that was the plan all along, but I never had the resource to implement it):

Leave markdown files as a "master" source of the content
Make a parser for those files, just accepting several types of paragraphs: header; list item starting with known "{Field}:" (documentation, note, example, etc.); codeblock
Run all existing files with the parser, doing steps towards each other in both (e.g. more formalization in files once the parser would be able to clearly report "line 150: unknown field Dcoumentation" + more flexibility in parser once we'll be sure that some, IDK, nested lists are necessary)
Then reuse the parser to both produce the current form of Jekyll-ready markdown (instead of current ad-hoc regexp-based processors... And once it would be better parser, the authoring might be made easier, so, for example, any {Foo#bar} in source files would be auto-replaced with links to Ruby docs)...
...and also to produce YAML, differently-organized md-files, good search indexes, etc.

About the cross-references and ids: it is hard question! Basically, currently I just do this (extracted from Kramdown header-to-id conversion) whenever I need to ensure ids are stable. Maybe in more powerful/formal structure, the raw titles can be used, e.g. "Follow-up: {2.7: Comparable#clamp with Range}" with parser smart enough to run it through header-to-id transformation. The problem is, of course, that if in the future somebody edits the headers in old files, all links would die.

The parser can handle it by link-validating, actually, but link dying is something that I am already trying to avoid (if somebody linked to the middle of the changelog years ago, I want the link to be alive forever, that's why, for example, Ruby 2.4's changelog has idiotic "Stdlib" section: I forgot to change the working title to proper "Standard library changes" before publishing and noticed it only in few months, and now I don't want to break people's links). It also can be somewhat mitigated by "renaming + assigning the old name as a secondary anchor," but :shrug:

PS: For some time, I investigated semi-formalized formats like ArchieML, but never found they compelling enough, so I came out with my own :shrug:

Phrogz commented 2 years ago

I'm glad you've thought along the same lines. I continued manual porting of 3.0 changes into the DB for testing, and created a quick-n-dirty first pass at a "distilled" overview of changes. Right now the different views are baked into the HTML by the script:

$ ruby distill.rb -h
Usage: ruby distill.rb [options]
    -h, --help                       Prints this help
        --releases                   Show a list of documented releases
    -f, --from 2.7                   Show only changes after this release (default: 2.7)
    -t, --to 3.1                     Show only changes up to and including this release (default: 3.1)
    -v, --verbose                    Show debugging output during run
    -b, --breaking-only              Show only changes that modify the way the language works, potentially affecting existing scripts
    -l, --language-only              Show only changes to the language (not specific classes/methods)
    -i, --important                  Show only the most important changes
    -r, --relevant                   Show only major/medium changes (ignore esoteric changes)
    -o, --output changes.html        Set the output filename (default: ruby-changes.html)

...but my plan is to generate a single uber document with Javascript filtering, and the ability to click on any change and see all the amazing information you've provided.

After 3.0 conversion I stopped the manual YAML conversion and made a pass at scraping the Markdown procedurally. I want to be able to stand on top of your work in the future, and not have a static snapshot that took hours to create and never is updated again.

The scraping is almost done, but it has the problem that I'd like to add metadata to each entry (like the "importance" level I'm using in my DB to filter between high/medium/low) that I don't think should be presented in URLs. Any thoughts on how to include such metadata per change if you continue to use Markdown?

Phrogz commented 2 years ago

Forgot to add pictures of what it is producing so far, to help sell why I think this is so important. :) Imagine a simple filter UI at the top of the page, getting this information, hovering each item to see a tooltip with a summary going into a little more detail, or clicking on an item and seeing full details filling the screen.

zverok commented 2 years ago

Your results are looking awesome :heart:

I'd like to add metadata to each entry (like the "importance" level I'm using in my DB to filter between high/medium/low) that I don't think should be presented in URLs. Any thoughts on how to include such metadata per change if you continue to use Markdown?

I think we can just add new list item types to sources in markdown. And either ignore it on rendering of the current site, or, well, not ignore but render them prettily, they are welcome addition :)

Phrogz commented 2 years ago

Alright, I'll head down that path. Thanks :)

Phrogz commented 2 years ago

Making progress. I could not think of an elegant way to add the information, so thus far its just an extra "field" in the Markdown that looks like this (see last line):

#### `Numeric#finite?` and `#infinite?`

* **Reason:** The methods were present in `Float` and `BigDecimal`, but not in other numeric classes, which made it harder to write code uniformly processing numbers which may be integer/float/infinite.
* **Discussion:** [Feature #12039](https://bugs.ruby-lang.org/issues/12039)
* **Documentation:** [Numeric#infinite?](https://ruby-doc.org/core-2.4.0/Numeric.html#method-i-infinite-3F), [Numeric#finite?](https://ruby-doc.org/core-2.4.0/Numeric.html#method-i-finite-3F)
* **Code:**
   … removed here to stop fighting GitHub formatting …
* **Note:** Notice that `infinite?` returns `nil`/`-1`/`1` (always `nil` for integers), not `true`/`false` as most of other predicate methods. While unusual, it is convenient for checking both for infinity and its sign (+Infinity/-Infinity), and can be treated effectively as `true`/`false` in boolean context.
* **Metadata:** `{kind:addition, importance:medium, scope:Numeric}`

Feel free to suggest alternatives.

Progress update: I've annotated 2.4 and half of 2.5, and am now scraping that information and emitting a single HTML file with all changes and details—weighs in around 260k—with JS that allows live filtering of the change summaries (see top of screenshot below). Still TODO:

Page shows full-screen details for a specific change.
Finish annotating all the rest of the changes.
Make a PR where you get to object to my categorization of specific changes.

zverok commented 2 years ago

TBH, I'd prefer not to introduce additional nested "pseudo-language" into the structure. What I strived for is the balance between formality and readability/writeability, including self-evidence of the format. I would've gone with this:

Either adjust headers #### Numeric#finite? and #infinite? (addition, medium)
Or add extra separate fields: * **Kind:** Addition\n* **Importance**: Medium

...so that even not preprocessed source would be readable as markdown & HTML. The first option is probably enough: the "tags" are short, obvious, and non-conflicting; the second one is a bit easier to auto-parse. In either option, the parser can swear if it meets some unrecognized value.

As for scope, I'd try to auto-guess that by header and/or links to docs. It would be a constant small irritation to say "Title: Numeric#foo?; docs: Numeric#foo?, scope: Numeric (can't you guess it already!)". It is all somewhat informal, but I tried to keep (some) consistency. I believe a small set of heuristics + clear diagnostics "ugh, the parser can't guess the scope, can you rephrase?" + maybe a fallback for optional * **Scope:** Numeric for complicated/less formal cases should be handleable.

kowal commented 12 months ago

hi @zverok @Phrogz while I have known this project for a longer time, I only found this discussion today :) I've been running a very similar project for the last 5 years. My ruby-changelog website was meant to be a very compact list of the most important changes with some code examples.

I did start with JSON file(s) as a source of truth for rendering .md files. The schema of the main data source looks like this:

{
  "ruby_versions": [
    {
      "version": "3.2",
      "version_info": "3.2.0 (Dec 2022) - 3.2.2 (March 2023)",
      "state": "Supported",
      "eol": "2026-03-31",
      "minors": [
        { "version": "3.2.2", "release_date": "2023-03-30", "end_date": "" },
        { "version": "3.2.1", "release_date": "2023-02-08", "end_date": "2023-03-30" }
      ],
      "implementations": [
        {
          "name": "MRI 3.2.2",
          "url": "https://www.ruby-lang.org/en/news/2023/03/30/ruby-3-2-2-released/"
        }
      ],
      "changes": [
        {
          "type": "internals",
          "tags": ["performance"],
          "experimental": false,
          "summary": "WASI based WebAssembly support",
          "links": {
            "news": "https://itnext.io/final-report-webassembly-wasi-support-in-ruby-4aface7d90c9"
          }
        },
        {
          "type": "internals",
          "tags": ["performance"],
          "experimental": false,
          "summary": "Production-ready YJIT"
        }
      ]
    }
  ]
}

This file (ruby_versions.json) is updated manually by me before each release. It also can be improved for sure :)
Another one (ruby_cve.json) is update by rake task which parses official Ruby Releases page.
Those JSONs are used to generate markdowns for mkdocs (I know :), thinking about switching to jekyll)
the nice part about keeping this in JSONs is that I can also generate other views, like this newly added timeline.

While my project doesn't really aspire to be as comprehensive as rubyreferences, I think the data schema which represents language changes is something that we could share in both projects.

I wonder what is your progress in defining this language-changes schema?