Open anton-k opened 12 years ago
+1 to this
I couldn't find any actual type synonym definition for Format
.
I used ag -s type | ag -s Format
and the results were useless.
Is the type synonym referred to a philosophical one? Or has it been fixed already?
Format is defined in pandoc-types
, specifically in Text.Pandoc.Definition
.
Note that it already has a "it just works" instance of Eq
that ignores case:
newtype Format = Format String
deriving (Read, Show, Typeable, Data, Generic)
instance IsString Format where
fromString f = Format $ map toLower f
instance Eq Format where
Format x == Format y = map toLower x == map toLower y
I seem to recall that the reason Format
is a String
is mainly due to the way extensions are specified, where they are just concatenated onto Format
using the +
and -
char. This is not just the way the CLI works, but also how the module itself exposes getReader
and getWriter
(i.e., the formats get passed through Text.Pandoc.parseFormatSpec
).
I mean it's arguably a hack, but changing it to an actual sum type (in fact a product type when you consider the set of extensions) will definitely break backward compatibility, so this is most likely going to be a 2.0 thing.
Oh, so it's a different package. Thanks for the info. I'll drop it for the time being.
+++ Tim T.Y. Lin [Mar 04 15 00:45 ]:
I seem to recall that the reason
Format
is aString
is mainly due to the way extensions are specified, where they are just concatenated ontoFormat
using the+
and-
char. This is not just the way the CLI works, but also how the module itself exposesgetReader
andgetWriter
(i.e., the formats get passed throughText.Pandoc.parseFormatSpec
).
No, the extensions are not part of the string on Format.
I mean it's arguably a hack, but changing it to an actual sum type (in fact a product type when you consider the set of extensions) will definitely break backward compatibility, so this is most likely going to be a 2.0 thing.
Right. In principle, a sum type would be better. However, it's a big change and would break lots of existing filters, so it's not clear it's worth it.
No, the extensions are not part of the string on Format.
Right, turns out getReader
/getWriter
takes a String
and not a Format
. Well that just makes it even more inconsequential then. Does anything directly use Format
when interfacing with Pandoc, other than filters written in Haskell?
One question is whether it should be possible to pass a custom Format, or whether Format can only contain known formats. I.e., should we use
data Format = Markdown | Docx | ReStructuredText | …
or rather
data KnownFormat = Markdown | Docx | ReStructuredText | …
data Format = Format KnownFormat
| CustomFormat String
I can see arguments for both variants; most arguments in favor of a finite sum type are given above. On the other hand, we'd limit users in their ability to pass format information to filters, custom writers, and programs built on top of pandoc's library.
Personally, I lean towards a finite sum type, as I feel the advantages out-weight the slight loss in flexibility. The only real problem I see is how to handle unknown formats specifications during parsing: Should those be turned into a default format, or maybe a code block?
I'm not sure about the finite vs extensible question, but like you I lean towards finite. The obvious approach would be to just omit raw content with an unknown format, with a log warning.
If we're thinking about this question, I think we might want to address a bigger issue about raw blocks. This has come up with ipynb. Jupyter notebook code cells will often generate output in multiple formats: for example, a table might be produced in text/latex and text/plain. The plain version is a fallback, so if you're converting to HTML, the HTML version will be used; if to LaTeX, the fallback would be to include the plain text version in a verbatim environment.
It's tough to handle this properly in pandoc. Given that the behavior of the reader is supposed to be independent of the writer, we can either (a) include both the HTML version as a raw block and the plain text version as a code block, with the result that you'll see TWO versions of the table when it's converted to HTML or (b) just include the HTML version, with the result that there will be no fallback when it's converted to LaTeX or other formats. A bad choice, which makes it impossible to fully emulate nbconvert.
One thing that would help here would be an AST element that includes content conditionally on the format. Something like this:
[ IfFormat HTML [RawBlock "<table>..."]
, IfFormat LaTeX [CodeBlock "..."]
]
With this kind of structure one could remove the Format specifier from the RawBlock itself.
But thinking about the fallback part of this, one sees a need for format specifications that encompass multiple formats, like HTML OR Markdown
or NOT(HTML OR Markdown)
. (Format could perhaps be a Boolean algebra, https://hackage.haskell.org/package/cond-0.4.1/candidate/docs/Data-Algebra-Boolean.html)
Jupyter notebook code cells will often generate output in multiple formats
Could you give a couple more examples? Is the fallback always plain-text? Or are the fallbacks at least ordered? Like try html
but if you cannot do that try some format and if all else fails try plain text?
Just a thought: instead of going with a whole boolean algebra, the ipynb reader could also put in a Raw "ipynb" ...
and then we would put in a pandoc filter (which would know what the input and output format is) that does the right thing. But yeah, maybe that's not actually better.
What I ended up doing is putting a little filter filterIpynbOutput
in T.P.App; if --ipynb-output=best
is selected, this tries to determine the best raw block to use, given the output format, and strips the others. So, a bit like your idea.
A few thoughts on a new Format
type.
Having a Formats
algebra to specify ranges of formats like in the stalled pull request is a good idea, as is having something like IfFormatBlock
and IfFormatInline
constructs (from this comment). I don't think the If*
constructors remove the need for the Format
in the Raw*
constructors, though, based on current usage. In Writers.Markdown
, as an example, the format of a RawBlock
influences how it's rendered, not just whether or not it's rendered.
One outline of a design is to include something like this in pandoc-types
:
module Text.Pandoc.Format where
-- Absolutely anything that might occur in Format right now is included. Requires a look through
-- the pandoc code base to get everything, I think.
data Format = HTML | HTML4 | HTML5 | EPUB | EPUB2 | EPUB3 | ...
deriving (..., Enum, Bounded)
-- The Formats boolean algebra is just the normal one for Set Format.
newtype Formats = Formats (Set Format)
-- As a format specifier or selector, Formats x means "any of the formats in x".
matchesFormat :: Formats -> Format -> Bool
(Formats s) `matchesFormat` f = f `Set.member` s
anyOf :: [Format] -> Formats
anyOf = Formats . Set.fromList
anyFormat :: Formats
anyFormat = anyOf [minBound..maxBound]
notFormat :: Formats -> Formats
notFormat (Formats s) = Formats $ t `Set.difference` s
where Formats t = anyFormat
-- and various other boolean operations on Formats
The Format type supports a sub-format relation, where x is a sub-format of y if a raw element of format x can always be included in an output format y. This (with helper functions) should make it easier to figure out when IfFormat*
and Raw*
elements should be rendered. The two functions below should represent that relation, the actual definitions requiring a look through pandoc to make sure they're accurate.
-- List the sub-formats of the given format
includesFormats :: Format -> Formats
includesFormats HTML = fromList [HTML, HTML4, HTML5, EPUB, EPUB2, EPUB3]
includesFormats HTML5 = fromList [HTML5, EPUB3]
includesFormats EPUB = fromList [EPUB, EPUB2, EPUB3]
-- etc.
-- List the super-formats of the given format
includedByFormats :: Format -> Formats
includedByFormats HTML = fromList [HTML]
includedByFormats HTML5 = fromList [HTML, HTML5]
includedByFormats EPUB = fromList [HTML, EPUB]
-- etc.
It would be simpler to have only concrete, fully-specified formats in Format
(and maybe consolidate formats that are indistinguishable from each other), but that would complicate things for Writers.Markdown
, which needs to be able to render a Format
when writing a RawBlock
or RawInline
. That also means that Format
can't easily be replaced by Formats
in those constructors.
Having a "big" Format
type should at least allow it to be used in places where Text
is used currently, like reader specification in Reader.readers
, or default extension selection in Extensions
.
Currently, Format
is used only by the writers to figure out how to render a RawBlock
and RawInline
. I have noticed a couple of things in pandoc
that have implications for the sub-format relation:
markdown*
formats are equivalent to each other in the sub-format sense, in that any raw element in one markdown*
format can always be included in the output for any other. The only way they differ seems to be in choosing default extensions (and that happens via a Text
string, not a Format
).commonmark*
, epub*
, slideous
, and so on, are not related to any other format in the sub-format sense, even themselves: they are never included in any output at all.If Format
is to be used in more places, it might be helpful also to have a
toConcreteFormat :: Format -> Format
toConcreteFormat HTML = HTML5
toConcreteFormat HTML5 = HTML5
toConcreteFormat EPUB = EPUB3
-- etc.
that takes under-specified formats and chooses a default concrete one for them, like the --to
option currently does.
Maybe a better way to define a sub-format is to say that x is a sub-format of y if whenever a raw element of format y can be included somewhere, a raw element of format x can be included in the same place and in the same way.
I have the feeling there are a few different "sub-format" relations..
RawInline
and RawBlock
AST elements, which enable you to include raw snippets of format X when doing -t X
, but also include tex
in markdown for example (while other writers would drop raw tex)-t markdown_phpextra
pandoc -t html
, it's a synonym for -t html5
-t epub
, there is an epub writer, which however calls the html writer-t pdf
uses either latex or html writer-t odt
basically zips what-t opendocument
would produce AFAIKYes, I think there are a few relevant relations. There are:
Raw*
one, so that writers can test how a Raw*
element should be included, if at allIfFormatBlock
and IfFormatInline
one, once they exist, so that conditional rendering happens properly--to
one, where some formats are aliases for other formatsI think jgm/pandoc-types#78 deals with the first two. The -t
one can be solved by making sure Writers.writers
is kept up-to-date, and maybe writing a toConcreteFormat :: Format -> Format
function.
I think the writers using other writers as intermediates sorts itself out naturally from the perspective of the first two relations, based on the current pandoc
behaviour. Right now it's stated in the manual that raw blocks need to use an html*
format to be included in epub*
output, and that the format to be included in -t pdf
is whatever the engine is, so I think there's an expectation that the format used to render the document initially won't be the same as the final output format.
The formats representing different extensions problem should also hopefully be solved in that pull request, for instance by considering all the markdown*
formats to be sub-formats of each other.
Format
is a synonym forString
. User have to look at the source code to find out right values for this type. (It can be"html"
or"Html"
or"latex"
or"LaTeX"
or"tex"
). It's not clear wich from the docs alone. Maybe it's better to define a new data type: