thesamovar / notpaper

GNU General Public License v3.0
20 stars 2 forks source link

LaTeX conversion to intermediary format instead of HTML #2

Open rorybyrne opened 3 years ago

rorybyrne commented 3 years ago

It might be worthwhile to convert LaTeX to some useful intermediate representation, instead of directly to HTML. That would allow this tool to behave like a compiler, with a front-end and a back-end.

There is a tool called LaTeXML which would allow conversion to XML as an intermediate representation.

rorybyrne commented 3 years ago

Another option written in Python: plastex

thesamovar commented 3 years ago

An intermediate format makes sense. Like that, could experiment with different tools.

rorybyrne commented 3 years ago

I think an abstract representation of a piece of research would be really valuable. It would need to be portable (i.e. dumpable to file that can be parsed by many languages), and friendlier than XML (imo).

JSON sounds like a good candidate to me, esp. if rendering it as a webpage is the first use-case.

rorybyrne commented 3 years ago

One idea I'm playing with is to separate the "content" of a paper from the "narrative" of a paper. In the below exampole, figures, tables, and paragraphs are indexed by ID under the content key, and then the narrative key contains a list of references to the various resources. This allows for multiple narratives, which represent different "views" on the research.

{
    title: "Some Title",
    abstract: "Abstract",
    authors: [ ... ],
    narratives: {
        full: [
            { type: "heading", content: "Introduction" },
            { type: "paragraph", id: "par01" },
            { type: "paragraph", id: "par02" },
            { type: "figure", id: "fig01" },
            ...
        ],
        short: [ ... ]
    },
    content: {
        paragraphsById: {
            "par01": "blah blah blah"
        },
        figuresbyId: {
            "fig01": "base64_image_data"
        }
    ]
}
thesamovar commented 3 years ago

An issue that might come up is that there is a LOT of detail to get right if you want to handle all possible papers. Might be better to use an existing standard like JATS (XML-based) and build on top of that?

rorybyrne commented 3 years ago

Oh nice, JATS sounds like it's exactly what we need. I'm not too familiar with existing publishing formats so thanks for pointing it out.

I think it sounds reasonable to use JATS as the target format for LaTeX/PDF conversion, and make the app take that as input. Modern (i.e. React) webapps maintain internal state in JSON format, so we'll have to convert it to some sort of JSON representation inside the app.

As long as there's a clean interface that takes JATS as input, I don't think we should necessarily use fully-compliant JATS for the app's internal state. We can try stick to it whenever possible, but it's more important to design state that suits the functionality requirements and keep it as lean as possible. Adding features and maintaining the code will be a nightmare otherwise (React code gets especially confusing if you're not careful).

I personally think it's fine if we don't support all the details of papers off-the-bat, there's a saying in the startup world that you shouldn't try to "boil the ocean". We can expand the internal state of the app as we add new functionality, and at some point it will probably reach parity with JATS.

What do you think?

thesamovar commented 3 years ago

I would assume it's very straightforward to switch between an XML and JSON representation? So it hardly matters much. The main thing would be to use JATS to make sure we haven't missed something that will come back and bite us later.

And yes absolutely, there's no way we'll support all features immediately.

rorybyrne commented 3 years ago

Yeah I assume XML -> JSON is a solved problem, and looking at the Pandoc docs it seems JSON is a supported --to option. Haven't had a chance to test it yet.

use JATS to make sure we haven't missed something that will come back and bite us later

I suppose a good rule of thumb would be "if it's not in JATS, it shouldn't be in our internal representation".

thesamovar commented 3 years ago

Well except that we want to potentially add fine grained metadata that won't be in JATS, but that's fine. Adding stuff on top can be added later. What we don't want to do is miss something fundamental that can't easily be added later on.

rorybyrne commented 3 years ago

Okay, noted. My React version isn't far off parity with your implementation, so once it's ready you can sanity check the data model I'm using and see if there's any major problems with it.