Make `Format` an enumerated type

anton-k commented 12 years ago

Format is a synonym for String. User have to look at the source code to find out right values for this type. (It can be "html" or "Html" or "latex" or "LaTeX" or "tex"). It's not clear wich from the docs alone. Maybe it's better to define a new data type:

data Format = FormatHtml | FormatTex | ...

mpickering commented 9 years ago

+1 to this

ghost commented 9 years ago

I couldn't find any actual type synonym definition for Format. I used ag -s type | ag -s Format and the results were useless.

Is the type synonym referred to a philosophical one? Or has it been fixed already?

timtylin commented 9 years ago

Format is defined in pandoc-types, specifically in Text.Pandoc.Definition.

Note that it already has a "it just works" instance of Eq that ignores case:

newtype Format = Format String
               deriving (Read, Show, Typeable, Data, Generic)

instance IsString Format where
  fromString f = Format $ map toLower f

instance Eq Format where
  Format x == Format y = map toLower x == map toLower y

I seem to recall that the reason Format is a String is mainly due to the way extensions are specified, where they are just concatenated onto Format using the + and - char. This is not just the way the CLI works, but also how the module itself exposes getReader and getWriter (i.e., the formats get passed through Text.Pandoc.parseFormatSpec).

I mean it's arguably a hack, but changing it to an actual sum type (in fact a product type when you consider the set of extensions) will definitely break backward compatibility, so this is most likely going to be a 2.0 thing.

ghost commented 9 years ago

Oh, so it's a different package. Thanks for the info. I'll drop it for the time being.

jgm commented 9 years ago

+++ Tim T.Y. Lin [Mar 04 15 00:45 ]:

I seem to recall that the reason Format is a String is mainly due to the way extensions are specified, where they are just concatenated onto Format using the + and - char. This is not just the way the CLI works, but also how the module itself exposes getReader and getWriter (i.e., the formats get passed through Text.Pandoc.parseFormatSpec).

No, the extensions are not part of the string on Format.

I mean it's arguably a hack, but changing it to an actual sum type (in fact a product type when you consider the set of extensions) will definitely break backward compatibility, so this is most likely going to be a 2.0 thing.

Right. In principle, a sum type would be better. However, it's a big change and would break lots of existing filters, so it's not clear it's worth it.

timtylin commented 9 years ago

No, the extensions are not part of the string on Format.

Right, turns out getReader/getWriter takes a String and not a Format. Well that just makes it even more inconsequential then. Does anything directly use Format when interfacing with Pandoc, other than filters written in Haskell?

tarleb commented 5 years ago

One question is whether it should be possible to pass a custom Format, or whether Format can only contain known formats. I.e., should we use

data Format = Markdown | Docx | ReStructuredText | …

or rather

data KnownFormat = Markdown | Docx | ReStructuredText | …

data Format = Format KnownFormat
            | CustomFormat String

I can see arguments for both variants; most arguments in favor of a finite sum type are given above. On the other hand, we'd limit users in their ability to pass format information to filters, custom writers, and programs built on top of pandoc's library.

Personally, I lean towards a finite sum type, as I feel the advantages out-weight the slight loss in flexibility. The only real problem I see is how to handle unknown formats specifications during parsing: Should those be turned into a default format, or maybe a code block?

jgm commented 5 years ago

I'm not sure about the finite vs extensible question, but like you I lean towards finite. The obvious approach would be to just omit raw content with an unknown format, with a log warning.

If we're thinking about this question, I think we might want to address a bigger issue about raw blocks. This has come up with ipynb. Jupyter notebook code cells will often generate output in multiple formats: for example, a table might be produced in text/latex and text/plain. The plain version is a fallback, so if you're converting to HTML, the HTML version will be used; if to LaTeX, the fallback would be to include the plain text version in a verbatim environment.

It's tough to handle this properly in pandoc. Given that the behavior of the reader is supposed to be independent of the writer, we can either (a) include both the HTML version as a raw block and the plain text version as a code block, with the result that you'll see TWO versions of the table when it's converted to HTML or (b) just include the HTML version, with the result that there will be no fallback when it's converted to LaTeX or other formats. A bad choice, which makes it impossible to fully emulate nbconvert.

One thing that would help here would be an AST element that includes content conditionally on the format. Something like this:

[ IfFormat HTML [RawBlock "<table>..."]
, IfFormat LaTeX [CodeBlock "..."]
]

With this kind of structure one could remove the Format specifier from the RawBlock itself.

But thinking about the fallback part of this, one sees a need for format specifications that encompass multiple formats, like HTML OR Markdown or NOT(HTML OR Markdown). (Format could perhaps be a Boolean algebra, https://hackage.haskell.org/package/cond-0.4.1/candidate/docs/Data-Algebra-Boolean.html)

mb21 commented 5 years ago

Jupyter notebook code cells will often generate output in multiple formats

Could you give a couple more examples? Is the fallback always plain-text? Or are the fallbacks at least ordered? Like try html but if you cannot do that try some format and if all else fails try plain text?

Just a thought: instead of going with a whole boolean algebra, the ipynb reader could also put in a Raw "ipynb" ... and then we would put in a pandoc filter (which would know what the input and output format is) that does the right thing. But yeah, maybe that's not actually better.

jgm commented 5 years ago

What I ended up doing is putting a little filter filterIpynbOutput in T.P.App; if --ipynb-output=best is selected, this tries to determine the best raw block to use, given the output format, and strips the others. So, a bit like your idea.

despresc commented 4 years ago

A few thoughts on a new Format type.

Having a Formats algebra to specify ranges of formats like in the stalled pull request is a good idea, as is having something like IfFormatBlock and IfFormatInline constructs (from this comment). I don't think the If* constructors remove the need for the Format in the Raw* constructors, though, based on current usage. In Writers.Markdown, as an example, the format of a RawBlock influences how it's rendered, not just whether or not it's rendered.

One outline of a design is to include something like this in pandoc-types:

module Text.Pandoc.Format where

-- Absolutely anything that might occur in Format right now is included. Requires a look through
-- the pandoc code base to get everything, I think.
data Format = HTML | HTML4 | HTML5 | EPUB | EPUB2 | EPUB3 | ...
  deriving (..., Enum, Bounded)

-- The Formats boolean algebra is just the normal one for Set Format.
newtype Formats = Formats (Set Format)

-- As a format specifier or selector, Formats x means "any of the formats in x".
matchesFormat :: Formats -> Format -> Bool
(Formats s) `matchesFormat` f = f `Set.member` s

anyOf :: [Format] -> Formats
anyOf = Formats . Set.fromList

anyFormat :: Formats
anyFormat = anyOf [minBound..maxBound]

notFormat :: Formats -> Formats
notFormat (Formats s) = Formats $ t `Set.difference` s
  where Formats t = anyFormat

-- and various other boolean operations on Formats

The Format type supports a sub-format relation, where x is a sub-format of y if a raw element of format x can always be included in an output format y. This (with helper functions) should make it easier to figure out when IfFormat* and Raw* elements should be rendered. The two functions below should represent that relation, the actual definitions requiring a look through pandoc to make sure they're accurate.

-- List the sub-formats of the given format
includesFormats :: Format -> Formats
includesFormats HTML = fromList [HTML, HTML4, HTML5, EPUB, EPUB2, EPUB3]
includesFormats HTML5 = fromList [HTML5, EPUB3]
includesFormats EPUB = fromList [EPUB, EPUB2, EPUB3]
-- etc.

-- List the super-formats of the given format
includedByFormats :: Format -> Formats
includedByFormats HTML = fromList [HTML]
includedByFormats HTML5 = fromList [HTML, HTML5]
includedByFormats EPUB = fromList [HTML, EPUB]
-- etc.

It would be simpler to have only concrete, fully-specified formats in Format (and maybe consolidate formats that are indistinguishable from each other), but that would complicate things for Writers.Markdown, which needs to be able to render a Format when writing a RawBlock or RawInline. That also means that Format can't easily be replaced by Formats in those constructors.

Having a "big" Format type should at least allow it to be used in places where Text is used currently, like reader specification in Reader.readers, or default extension selection in Extensions.

despresc commented 4 years ago

Currently, Format is used only by the writers to figure out how to render a RawBlock and RawInline. I have noticed a couple of things in pandoc that have implications for the sub-format relation:

all of the markdown* formats are equivalent to each other in the sub-format sense, in that any raw element in one markdown* format can always be included in the output for any other. The only way they differ seems to be in choosing default extensions (and that happens via a Text string, not a Format).
many output formats like commonmark*, epub*, slideous, and so on, are not related to any other format in the sub-format sense, even themselves: they are never included in any output at all.

If Format is to be used in more places, it might be helpful also to have a

toConcreteFormat :: Format -> Format
toConcreteFormat HTML = HTML5
toConcreteFormat HTML5 = HTML5
toConcreteFormat EPUB = EPUB3
-- etc.

that takes under-specified formats and chooses a default concrete one for them, like the --to option currently does.

despresc commented 4 years ago

Maybe a better way to define a sub-format is to say that x is a sub-format of y if whenever a raw element of format y can be included somewhere, a raw element of format x can be included in the same place and in the same way.

mb21 commented 4 years ago

I have the feeling there are a few different "sub-format" relations..

you have the RawInline and RawBlock AST elements, which enable you to include raw snippets of format X when doing -t X, but also include tex in markdown for example (while other writers would drop raw tex)
same writer, different extensions enabled:
- e.g. -t markdown_phpextra
- similarly, when doing pandoc -t html, it's a synonym for -t html5
different writer, but it uses another writer
- when doing -t epub, there is an epub writer, which however calls the html writer
- -t pdf uses either latex or html writer
- -t odt basically zips what-t opendocument would produce AFAIK

despresc commented 4 years ago

Yes, I think there are a few relevant relations. There are:

the Raw* one, so that writers can test how a Raw* element should be included, if at all
the IfFormatBlock and IfFormatInline one, once they exist, so that conditional rendering happens properly
the --to one, where some formats are aliases for other formats

I think jgm/pandoc-types#78 deals with the first two. The -t one can be solved by making sure Writers.writers is kept up-to-date, and maybe writing a toConcreteFormat :: Format -> Format function.

I think the writers using other writers as intermediates sorts itself out naturally from the perspective of the first two relations, based on the current pandoc behaviour. Right now it's stated in the manual that raw blocks need to use an html* format to be included in epub* output, and that the format to be included in -t pdf is whatever the engine is, so I think there's an expectation that the format used to render the document initially won't be the same as the final output format.

The formats representing different extensions problem should also hopefully be solved in that pull request, for instance by considering all the markdown* formats to be sub-formats of each other.

jgm / pandoc

Make `Format` an enumerated type #547