rust-lang / mdBook

Create book from markdown files. Like Gitbook but implemented in Rust
https://rust-lang.github.io/mdBook/
Mozilla Public License 2.0
17.64k stars 1.61k forks source link

YAML for summary #176

Closed gambhiro closed 7 years ago

gambhiro commented 7 years ago

I'd like to open a discussion on using a YAML list for the summary file, and an optional YAML header in the Markdown chapters.

I realize that this would effectively kill your work on parsing the Markdown summary, but this would pay off for more complex TOC structures, while still being easy to write for the simple case.

The thing is that things start to get complicated when you have to start to try and satisfy the EPUB and MOBI specifications.

I am not suggesting TOML because it is not good at representing these kind of nested hasmap types.

In the simplest case the list items are the source path, and you can parse the title from the first heading. In this case the Cover, Title Page and Contents pages for the ebook formats can take sensible defaults.

- part-1.md
  - nameless-labyrinth.md
  - glacial-surface.md
  - countless-windows.md
    - closed-shutters.md
- part-2.md
  - semblance-of-equilibrium.md

For the ebooks the <guide> tag is mandatory in OEBPS/content.opf:

 <guide>
    <reference href='chapters/cover.xhtml' title="Cover" type='cover' />
    <reference href='chapters/titlepage.xhtml' title="Title Page" type='title-page' />
    <reference href='chapters/toc.xhtml' title="Contents" type='toc' />
</guide>

And YAML really pays off when you have to manipulate chapter properties, or to indicate for the ebook <guide> what's what. Say, when your book has a special titlepage design or the 'Cover' is called 'Capa' (pt) or 'Borító' (hu).

- { title: "Cover", path: "cover.xhtml", type: cover, linear: no }
- { title: "Title Page", path: "titlepage.xhtml", type: title-page }
- { title: "Contents", path: "toc.xhtml", type: toc }
- part-1.md
  - { title: "The Nameless Stone Labyrinth", path: "nameless-labyrinth.md" }
  - nameless-labyrinth.md
  - glacial-surface.md
  - countless-windows.md
    - closed-shutters.md
- part-2.md
  - semblance-of-equilibrium.md

There is an example TOC here.

azerupi commented 7 years ago

I am open to this idea, but there are a couple of concerns that need to be addressed first. I am also very interested in other users opinion on this.

I have not much experience with YAML, some concerns / questions might seem stupid :wink:

Here are the concerns:

In the simplest case the list items are the source path, and you can parse the title from the first heading.

  1. Not specifying a title in the summary file requires one to parse the markdown files to render a TOC. It's definitively possible, but it adds complexity to the implementation. Currently, the markdown files are only parsed one at a time at the moment of rendering.
  2. What happens when both the markdown file and the summary file do not contain a title?
  3. There is a simple pattern currently to distinguish between "front matter", "main matter" and "back matter":
   [Introduction](intro.md)
   - [Chapter 1](ch1.md)
   - [Chapter 2](ch2.md)
   [Conclusion](concl.md)

Note that the chapters in "main matter" are preceded by a hyphen to allude to the numbering. Is there an equally nice way to achieve this in YAML?

  1. There is a feature in mdBook where you can provide a title and a blank link for a chapter which will result in a grayed out link in the TOC of the html book.

    - [Chapter 1](ch1.md)
    - [Chapter 2](ch2.md)
    - [Future Chapter]()

    This is a really useful feature, I know @steveklabnik uses it a lot for the new Rust book. How could this be represented in YAML?

  2. The example given:

    - { title: "Cover", path: "cover.xhtml", type: cover, linear: no }
    - { title: "Title Page", path: "titlepage.xhtml", type: title-page }
    - { title: "Contents", path: "toc.xhtml", type: toc }

    Feels out of place. This information seems to be only relevant for the EPUB version and thus should probably be put in a section of the configuration file that is specific to that renderer.


I don't remember if I've written this down already, but I will summarize the way I envision how the renderers will / should work:

The MDBook struct and other internal representations should be as generic as possible. Containing enough information for the renderers to work with, without piling up renderer specific properties.

The configuration file will allow subsections for each renderer to provide extra data and settings. This is where you would put renderer specific settings and properties. Because the "core" should not know about renderer specific settings, those sections will be passed down to the renderer as is. It's the renderers job to extract and make sense of those settings.

The renderer is passed the MDBook struct + the renderer specific settings and should be able to do it's thing.

Does this make sense? :smile:

EDIT: See #149

gambhiro commented 7 years ago

Thanks for adding that description about the renderers. Yes, I was thinking in terms of how to represent everything at once, and instead the details can be broken up to different targets.

Nonetheless I answer the YAML questions too.

  1. having to parse the Markdown to get a title

True, you'd have to walk through the chapter files and parse each of them while you are building the Vec<Chapter>. At least this is what I ended up doing earlier with prophecy, I converted the Markdown to HTML and used and xml-parser to get the first <h1> with an xpath expression. It's quite a trip just to get a title, but the machine did the work, while I only had to list the markdown file names.

  1. missing title

Default to Untitled. The user will see and add a title somewhere.

  1. distinguishing front- main- and back-matter

Yes. I remember now I had that problem. Thinking about that now, the best way would be to have different lists:

frontmatter:
- preface.md
- intro.md

mainmatter:
- chapter1.md
- chapter2.md

backmatter:
- glossary.md
- appendix.md
  1. grayed out link

One could provide a hash key to indicate that, but I agree this is much more verbose:

- { title: "Future Chapter", status: "draft" }
  1. too much information

Yes. It does seem it would be best to figure out what convention to use for specifying more data for renderers which need it. Because at that point you do start wanting a format that can be easily serialized into Rust structs.

azerupi commented 7 years ago

True, you'd have to walk through the chapter files and parse each of them while you are building the Vec.

I suppose we don't really have to parse the whole markdown file, only the first (non-blank) line. Which would make it less of a problem.

Default to Untitled. The user will see and add a title somewhere.

I guess that + a warning on build could do the trick.

frontmatter:
- preface.md
- intro.md
mainmatter:
- chapter1.md
- chapter2.md
backmatter:
- glossary.md
- appendix.md

That's good for me.

One could provide a hash key to indicate that, but I agree this is much more verbose

I don't like that solution because when you finally add that chapter, you have to change from:

- { title: "Some chapter", draft: true }
# to
- some/path/chapter.md

Maybe it's better to assume that when the .md extension is missing it's a title?

- some/path/ch1.md
- some/path/ch2.md
# The next chapter will be rendered grayed out because it does not have a path
- Some future chapter 
# Another, more explicit way
- title: "Some future chapter"

What do you think?

azerupi commented 7 years ago

I have been playing with a YAML parser to see what can and can't be done and

- part-1.md
  - title: "The Nameless Stone Labyrinth" 
    path: "nameless-labyrinth.md"
  - nameless-labyrinth.md
  - glacial-surface.md
  - countless-windows.md
    - closed-shutters.md
- part-2.md
  - semblance-of-equilibrium.md

Can not be done apparently. Nested lists like that are not supported in YAML. This is a major drawback, because it forces a more complicated format for simply nesting chapters.

steveklabnik commented 7 years ago

YAML is also an extremely complex format in general. I've used it for years in Ruby, I hope to never use it again, personally.

azerupi commented 7 years ago

Yes, after toying with it I have to agree with Steve. I can't bend it to my will without making the summary file unnecessarily complex. I've also heard a lot of negative echos towards YAML. So I would be inclined to avoid YAML.

However, I do think that it's important that we find a way to add more metadata to chapters, if only for future-proofing.

Another way to solve this problem is to allow extra metadata to be added as a markdown paragraph:

# Summary

- [Chapter 1](ch1.md)
- [Chapter 2](ch2.md)
    - [Section 2.1](ch2/sec1.md)
    - [Section 2.2](ch2/sec2.md)
        - author: "John Doe"
        - otherproperty: "SomeValue

    - [Section 2.3]()
- [Chapter 3](ch3.md)
-----------
[Appendix](appendix.md)

Advantages:

Drawbacks:

Or we can go the gitbook route, by allowing a yaml frontmatter at the top of the markdown files adding extra metadata.

I am not sure what's best. :confused: Personally, I like that all the information is regrouped in the summary file.

steveklabnik commented 7 years ago

However, I do think that it's important that we find a way to add more metadata to chapters, if only for future-proofing.

Yeah, it'd be good to have more stuff. I'm not sure what's the best there either. Maybe TOML could work?

gambhiro commented 7 years ago

Please allow me to champion the YAML case futher with a parser example: https://github.com/gambhiro/yaml-summary

Don't take it too literally, it's just to indicate how this might work.

Implementing the From trait for the different cases as types, you can just .map the chapters:

for thematter in ["frontmatter", "mainmatter", "backmatter"].iter() {
    let chapters: Vec<Chapter> = doc[*thematter].clone()
        .into_iter()
        .map(|x| Chapter::from(x))
        .collect();

    print!("{}:\n\n{:?}\n\n", thematter, chapters);
}

Yes, I'm sorry my example was not valid YAML. The correct format will follow below.

As for parsing chapter properties out of a Markdown paragraphs, it is pushing a simple hack too far. Just think of correctly parsing quotes and colons in all the ways a user might type it.

The virtue of the Markdown summary is that a nested list is an easy analogy for a nested array, but it stops there. Beyond that you are inventing another serialization format.

I think the best is not either Markdown or YAML, but both. Select the parser by the file extension, SUMMARY.md or SUMMARY.yml.

The easiest for the new user is the Markdown file, provide them with the basic features that are easy in the Markdown.

For the users who need extensive customization, let them write the YAML.

If both the Markdown and Yaml parser builds the same Rust strutct, then the Markdown parser can simply take defaults where the Yaml can customize it all.

I agree that YAML is not always pleasant to write, but still, you CAN when you have to. I know I will need to tweak chapter and book properties a lot to get both EPUB and MOBI to behave just so.

So the following YAML works, try with https://nodeca.github.io/js-yaml/

frontmatter:
- preface.md
- intro.md

mainmatter:
- chapter1.md
- {path: chapter2.md,  title: "Nameless Stone Labyrinth",
   sections: [
     ch2-sec1.md,
     ch2-sec2.md
   ]}
- chapter3.md

backmatter:
- glossary.md
- {title: "Appendix A", path: appendix-a.md}
- {title: "Appendix B", path: appendix-b.md}

This is used in the example linked above, where it prints:

frontmatter:

[Chapter { title: "preface.md", path: "", draft: true, sections: [] }, Chapter { title: "intro.md", path: "", draft: true, sections: [] }]

mainmatter:

[Chapter { title: "chapter1.md", path: "", draft: true, sections: [] }, Chapter { title: "Nameless Stone Labyrinth", path: "chapter2.md", draft: false, sections: [Chapter { title: "ch2-sec1.md", path: "", draft: true, sections: [] }, Chapter { title: "ch2-sec2.md", path: "", draft: true, sections: [] }] }, Chapter { title: "chapter3.md", path: "", draft: true, sections: [] }]

backmatter:

[Chapter { title: "glossary.md", path: "", draft: true, sections: [] }, Chapter { title: "Appendix A", path: "appendix-a.md", draft: false, sections: [] }, Chapter { title: "Appendix B", path: "appendix-b.md", draft: false, sections: [] }]
thebergamo commented 7 years ago

If can I contribute with 2 cents...

# Summary

- [Chapter 1](ch1.md)
- [Chapter 2](ch2.md)
    - [Section 2.1](ch2/sec1.md)
    - [Section 2.2](ch2/sec2.md)
        - author: "John Doe"
        - otherproperty: "SomeValue

    - [Section 2.3]()
- [Chapter 3](ch3.md)
-----------
[Appendix](appendix.md)

Can be easy represented by a JSON file

{
    "title": "Summary",
    "chapters": [
        { "title": "Chapter 1", "file": "ch1.md" },
        {
             "title": "Chapter 2",
             "file": "ch2.md",
             "sections": [
                 { "title": "Section 2.1", "file": "ch2/sec1.md" },
                 { "title": "Section 2.2", "file": "ch2/sec2.md", "author": "John Doe", "other property": "Some Value" },
                 { "title": "Section 2.3" }
             ]
        },
        { "title": "Chapter 3", "file": "ch3.md" },
    ],
    "extra": [
        { "title": "Appendix", "file": "appendix.md" }
    ]
}

JSON is a well known file type and has a lot of parsers and specs, so just check the semantics will be fine to guarantee the quality of file and the flexibility will help to extend the file in future. We can map the JSON schema to an Struct and get the results easily to parse the file in the structure of the correlates in HTML, once you are processing a kind of plan text with markdown, with JSON maybe a lot of work will be done by a parser crate.

gambhiro commented 7 years ago

@thebergamo JSON can do it too, but the advantages you mention are true for either JSON, YAML or TOML. Between those, you have to ask:

The summary file is all about the nested array for the table of contents, where the items are either a string, a hashmap or another array.

In JSON this is quite verbose, difficult to read and write by hand, and TOML doesn't have a good story there at all. YAML has its share of idiosyncrasies in its syntax but at least you don't have to quote and brace everything, and the parses gives you clear Rust types to match in each case.

thebergamo commented 7 years ago

@gambhiro After reading some of the others issues I got this point about the "human writable" file, but this can be done by a process, reading the directories of the "book" project and create the summary file automatically.

One time we've some of Web Frameworks in Rust, we've a parser for json(Sorry, I'm not search this to confirm).

Not sure about it, but how difficult is writing a DSL for this stuffs? Because we can create an simple file style and validations to just validate the summary itself (maybe an over-engineer)

azerupi commented 7 years ago

@thebergamo

this can be done by a process, reading the directories of the "book" project and create the summary file automatically.

I am curious to hear how you would do that, considering that directories and files don't contain any information about order. If you can figure that out, this discussion is pointless as we wouldn't need the summary file at all. :wink:

Besides, we want (and currently have) the opposite workflow:

  1. Create a summary file
  2. Build the book
  3. All missing files and directories will be created for you

As for Json, I would want to avoid it. It is indeed a very well known format, all languages have top-notch parsers for it. But it is not human-friendly. It is hard to write and hard to read. Just compare the two snippets you provided and imagine what a full-fledged book with hundreds of chapters would look like!

The summary file will be mostly written by humans and so we should optimize for user-friendliness.

@gambhiro

As for parsing chapter properties out of a Markdown paragraphs, it is pushing a simple hack too far. Just think of correctly parsing quotes and colons in all the ways a user might type it.

The virtue of the Markdown summary is that a nested list is an easy analogy for a nested array, but it stops there. Beyond that you are inventing another serialization format.

I find none of the discussed formats satisfactory, they all lack something. Json and Yaml (to a lesser extent) are more verbose and not as user-friendly as I would want them to be.

I don't consider "augmenting" markdown to be "pushing a hack too far". It's not a standard format, but it is still valid markdown.

I think the best is not either Markdown or YAML, but both. Select the parser by the file extension, SUMMARY.md or SUMMARY.yml.

It's a solution, but I don't like it particularly. It means you have to port your summary file from one format to the other if you ever need more complex features. This is highly ergonomic to the point where people will probably avoid the "simple" format just in case they would ever need something only supported in the other format.

gambhiro commented 7 years ago

It means you have to port your summary file from one format to the other if you ever need more complex features.

Automatic conversion could cover that case: parse SUMMARY.md to Rust structs, then write them out to SUMMARY.yml, while skipping default values.

The user could call it as a command line command.

people will probably avoid the "simple" format

I don't think so, Markdown is attractively simple, as long as all you need is a chapter sequence. Also, this is partly about how you "market it" in the documentation. If that is what the new user meets first in a "getting started" guide, it looks easy to get results fast.

I think ppl just want to get stated fast, instead of thinking about every stage of their doc or book at the beginning, especially if a conversion command is available.

thebergamo commented 7 years ago

@azerupi

I am curious to hear how you would do that, considering that directories and files don't contain any information about order. If you can figure that out, this discussion is pointless as we wouldn't need the summary file at all. 😉

Once we've a structure knowed like: |_chapter_1.md |section_1_1.md |_chapter_2.md |section_2_1.md |__section_2_2.md ...

You can run in the folder and check the values, at the major of the informations in the summary we've links to these files. Suggesting this structure we can parse it automatically.

Or as I saw, we can just create an DSL for these stuffs.

Make sense?

azerupi commented 7 years ago

Once we've a structure knowed like

Yes, but that is a no-go. Removing or adding a chapter in the middle would require renaming all the other chapters.

we can just create an DSL for these stuffs.

What do you mean by that?

thebergamo commented 7 years ago

Yes, but that is a no-go. Removing or adding a chapter in the middle would require renaming all the other chapters.

Make sense, I'm think a little about a cli tool to manage it, but maybe will be a little over.

Once we've a problem with the knowed formats, and have a parser for markdown. If we just have or own format for this task will be ok, creating a way to specify the summary with some keywords that help us to parse the file.

Not sure about how it'll be, at now is just an idea, creating simple keywords to help us to parse that.

What you think?

azerupi commented 7 years ago

I'm sorry, I didn't fully understand what you propose. Could you explain it a little more? :)

thebergamo commented 7 years ago

Sorry for the bad English :( (I'm improving it! I swear!)

Ok, lets give some examples, maybe a visual example can explain better;

summary's file

summary
    chapters
        Chapter 1: 
            file: ch1.md
            author: Mark Twain
        Chapter 2: 
            file: ch2.md
            sections:
                Section 2.1:
                    file: sec2_1.md
                    author: Mark Twain
    chapters_end
summary_end

Isn't the best human readable/writable file format, but an idea of how the summary file can be.

gambhiro commented 7 years ago

@thebergamo the necessary properties in the book model structs are difficult to predict at this point. You'd need to have a settled representation of the internal book models to design a DSL that will cover everything. At this stage, you would have to keep changing the DSL as the models change. Also, where is the point where you aren't just inventing another TOML or YAML?

In addition, if someone uses mdbook as a lib, and they want to keep some extra data in the summary file, they would also have to extend the DSL just to access a hashkey or two. It seems overkill to me.

thebergamo commented 7 years ago

@gambhiro I got your point and make sense for me.

gaites commented 7 years ago

I was wondering what the shortcomings of TOML are for implementing the summary? The Value::Array from toml-rs seems to handle the use case pretty well. I'm really new to programming, so I'm sure that I'm missing something. One great thing about this format is the option of iterating through keys and easily storing arbitrary config options in a contained Vec which could be used by consumers of the lib.

The table name could act as the chapter name as a default, with option to overwrite. Part of it would be selecting sensible defaults and creating good documentation for the feature. It would also allow using TOML for the config file instead of having too many different file formats throughout the project.

[TOC] key = "value" key2 = "value2" [TOC.frontmatter] key= "value" [TOC.frontmatter.foreward] name="Really Great Foreward by Edgar Allen Poe"

azerupi commented 7 years ago

@Saiyt Well the first issue I see is that you need to repeat the "namespace" for every chapter over and over again. This means you would have to do a lot of search&replace when you move things around.

The characteristics I am looking for are the following:

Other things to consider:

Considering all of this, my personal favorite a this moment is still the "augmented markdown" version.

gambhiro commented 7 years ago

Considering all of this, my personal favorite a this moment is still the "augmented markdown" version.

How interesting! I guess we are not all the same :)

Never mind, I can add a YAML parser if I need that, as long as the book representation is clear.

azerupi commented 7 years ago

What I like the most right now, an idea from @gambhiro, is to keep the same format for the summary but allow an optional TOML header in chapters with chapter specific information.

gambhiro commented 7 years ago

I don't have much to add, it was a good discussion but even for the multilang feature I found it sufficient to use book.toml for storing any renderer-specific metadata, and the other data came from SUMMARY.md plus toml chapter headers.

In other words I recommend closing this if the above is a satisfactory conclusion.