Generate OCaml modules with the static data in `ood`

tmattio commented 3 years ago

This PR implements a new binary ood-gen that generate OCaml modules for the data in data/.

As mentioned in the issues, this has the following benefits:

Remove complexity from users by dealing with parsing and processing in ood.
Ensure the data is consistent between the "QA" project (ood-preview) and the production stack.
Catch data errors at compile time.

From what I can tell, and after discussing with @patricoferris, the only drawback of this approach is the size of the generated modules. We're around 9Mb in size, including the media, which account for ~7.5Mb. I think 1.5Mb for all of the data is acceptable. I'm feeling slightly uncomfortable about the 7.5M of images, but I feel this is also acceptable at the moment, and we can aim at moving them to a CDN medium-term.

This is mostly ready to review (it breaks the netlify stuff, which I'll fix soon, but the rest of the PR is ready)

Note that this changes the types in ood, so I'll open a PR to fix the build on v3.ocaml.org before merging this.

tmattio commented 3 years ago

cc @avsm @agarwal @kanishka-work @rdavison. This is a rather invasive change and will impact the development workflow, so I'll wait for us to reach a consensus before moving forward with this 🙂

avsm commented 3 years ago

This'll remove the need for Yaml parsing in v3.ocaml.org, won't it? I'm all for that, as the code duplication is significant (and this means we won't have to deal with the null/undefined issue with Nextjs)

tmattio commented 3 years ago

This'll remove the need for Yaml parsing in v3.ocaml.org, won't it?

Yes, and the markdown parsing as well :)

agarwal commented 3 years ago

@tmattio and I discussed this, and we feel it could potentially be useful.

@tmattio In the call, we thought that JSON serialization using an unsafe coercion would be fine since we'd be starting with well typed values. That is incorrect. The unsafe coercion by ReScript converts option types to undefined, which is exactly the problem. We're not sure, but maybe other types also wouldn't work.

I spoke with @kanishka-work and @rdavison after, and we might have a workaround. Basically, we can reference the values from your generated code directly in components, rather than in getStaticProps. Thus, the need for JSON serializers goes away. One concern we have about this is performance. For example, all data for all languages will be pulled at page-load time even though any given page only needs one language's data. If the data was selected in getStaticProps instead, code splitting would optimize this at site-generation time. Thus, our workaround isn't perfect, but as a practical matter may not affect us for a long time in this project since total data sizes are small.

Thus, it could be worth pushing on this and creating a companion PR in v3.ocaml.org to fully test it out.

agarwal commented 3 years ago

Regarding markdown processing, I'm not sure yet what the benefits are of doing it here vs in v3.ocaml.org. I suppose it could be done here, but beware that the generated html might need to adhere to requirements of the design in v3. Or if the html is not styled, we'll have to use some method to add the styling in v3. Also, not quite sure how to hook into a highlighting solution.

ghost commented 3 years ago

The pipeline of markdown processing here (https://github.com/ocaml/v3.ocaml.org/blob/master/pages/resources/%5Btutorial%5D.res#L48) can serve as a guide for the steps that need to be implemented with omd and supporting scripts. You can see an example of the rendering here: https://v3.ocaml.org/resources/basics.

I understand that the current ocaml.org site also has logic for table of content generation and highlighting, if that's quicker to follow.

avsm commented 3 years ago

I think it's worth a go at this too -- if module sizes become a problem then we can optimise that in a few ways, but I've crunched up far bigger bits of content in mirageos without issue in the past (https://github.com/mirage/ocaml-crunch).

The advantage of markdown processing in omd is that we use a single implementation instead of two. There are lots of subtle variations and extensions in implementations, but at least with omd we can extend it to our needs. We can certainly modify omd2 fairly easily to account for the specific html needed by v3.ocaml.org. (and in general, ensure omd is compatible with tailwind 2)

tmattio commented 3 years ago

Thank you all for the great feedback!

I have been exploring how a transition to this would look like from the v3.ocaml.org repo. Although it still needs some work to move from the current state (especially because of the disparity between the data available in ood and the data expected in v3), I'm confident there won't be any major blocker to migrate to this PR.

It also looks like we have a consensus that moving the processing away from the frontend and into ood is sensible.

With this in mind, I'd like to suggest that we merge this PR now. We still have a lot of content to add to ood, and it'd be more efficient to do it on top of this PR, than keeping sync with every merge in master.

I'll of course still take the responsibility to migrate the frontend to the new ood, but I prefer to stay focused on the data import immediately, and later on focus on the frontend once ood is in a good shape.

Does it seem like a good approach to you?

tmattio commented 3 years ago

To also answer specific concerns:

not quite sure how to hook into a highlighting solution

From what I can see, v3 uses Tailwind's typography plugin. I've added highlighting of code blocks in #32 with highlight.js and Tailwind's typography plugin, so I figure it would work similarly.

I understand that the current ocaml.org site also has logic for table of content generation and highlighting

Right, the PR does not extract the TOC from the markdown documents at the moment. But that's something we've discussed and it will be easy to add a TOC to the data that require one.

In the meantime, the raw markdown is provided, so any processing currently done in v3 can be left intact.

ghost commented 3 years ago

In the meantime, the raw markdown is provided, so any processing currently done in v3 can be left intact

I will start working on removing remark and invoking omd on the provided string, some time soon. After that branch is ready in v3, we can make a final decision on which phase of processing that we want to invoke omd.

ghost commented 3 years ago

if module sizes become a problem then we can optimise that in a few ways

I did a quick scan a few months ago and found Thumbor and Imageflow for self-hosting. I didn't do an exhaustive search and comparison. There also quite a few commercial services.

ghost commented 3 years ago

Here is an example of how NextJS lays out its generated data modules, in case this provides any inspiration for how to organize the generated OCaml modules:

{
  "pageProps": {
    "content": {
      "title": "History",
      "pageDescription": "A history of ocaml.org.",
      "timeline": [
        {
          "date": "Nov 2020",
          "description": "A team of programmers, designers, and content writers begins working full-time on a new implementation."
        },
        {
          "date": "2014 - 2020",
          "description": "Christophe Troestler and Anil Madhavapeddy continue to maintain the site with contributions from 250 more volunteers."
        },
        ...
      ]
    }
  }
}

ocaml / ood

Generate OCaml modules with the static data in `ood` #36