jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.27k stars 3.36k forks source link

Archictectural change: Make all writers useable without IO #2930

Closed jgm closed 7 years ago

jgm commented 8 years ago

Currently a number of writers are impure and do IO: docx, odt, epub, epub3, fb2, icml, rtf.

It would be nice to use a free monad instead, so that these writers could be used either outside of IO contexts. First step would be to catalog the places where IO is really used in these writers.

jgm commented 7 years ago

For the reference.docx, did we think it should go into WriterOptions anyway? (Both the user's reference.docx and the "default" reference.docx, which we need too.)

jkr commented 7 years ago

I'd like to avoid having two separate versions of the writer, a pure and an impure. Instead, we should have a single writer parameterized to the PandocMonad typeclass or whatever it gets called.

Right, that's what we have right now in the writers in my typeclass branch, with precisely the type signature you describe:

writeDocx :: PandocMonad m => WriterOptions -> Pandoc -> m ByteString

What I was describing was what to do with the output of the pure one. Really, in either version -- the free or the typeclass -- you're going to either have to introduce the state or have a wrapper (hence the writePure function above) that hides it from the user.

In the example you mention above:

midnightDocx = setTime midnight >> readMarkdown opts input >>= writeDocx opts

You'd still end up with a State Whatever ByteString output from writeDocx. And this would also be the case after running runPure with free. We could have Data.Default and call it def, but it's going to have to come in somewhere. I don't think either approach does any more than the other in letting us avoid that.

jkr commented 7 years ago

For the reference.docx, did we think it should go into WriterOptions anyway? (Both the user's reference.docx and the "default" reference.docx, which we need too.)

I think so -- only question is whether it should go in as a strict ByteString or an Archive. My preference would be the BS, since it would require the user to have one less library loaded.

jkr commented 7 years ago

But it is a bit of a departure to put such an opaque blob in there.

jgm commented 7 years ago

Yes, sorry, my original reply was based on a misunderstanding.

I think having something that hides the state from the user is a good idea. That's one nice thing about the Free Monad approach -- we can use state behind the scenes in the interpreter without it ever even being exposed to the user. Here, we need to expose it.

Or do we? We could make it opaque behind a newtype.

We'd still need to provide a runPure type function to extract real content out of this newtype.

Maybe that's the best approach?

+++ Jesse Rosenthal [Nov 21 16 08:27 ]:

I'd like to avoid having two separate versions of the writer, a pure
and an impure. Instead, we should have a single writer parameterized
to the PandocMonad typeclass or whatever it gets called.

Right, that's what we have right now in the writers in my typeclass branch, with precisely the type signature you describe: writeDocx :: PandocMonad m => WriterOptions -> Pandoc -> m ByteString

What I was describing was what to do with the output of the pure one. Really, in either version -- the free or the typeclass -- you're going to either have to introduce the state or have a wrapper (hence the writePure function above) that hides it from the user.

In the example you mention above: midnightDocx = setTime midnight >> readMarkdown opts input >>= writeDocx opts

You'd still end up with a State Whatever ByteString output from writeDocx. And this would be the case after running runPure with free. We could have Data.Default and call it def, but it's going to have to come in somewhere. I don't think either approach does any more than the other in letting us avoid that.

— You are receiving this because you were mentioned. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

References

  1. https://github.com/jgm/pandoc/issues/2930#issuecomment-261988559
  2. https://github.com/notifications/unsubscribe-auth/AAAL5PluHZsH4ZUjYHFtFSaXC2CoGZr4ks5rAcaBgaJpZM4Iiem3
jgm commented 7 years ago

+++ Jesse Rosenthal [Nov 21 16 08:30 ]:

For the reference.docx, did we think it should go into WriterOptions
anyway?
(Both the user's reference.docx and the "default" reference.docx,
which we need too.)

I think so -- only question is whether it should go in as a strict ByteString or an Archive. My preference would be the BS, since it would require the user to have one less library loaded.

Agreed.

jkr commented 7 years ago

That's one nice thing about the Free Monad approach -- we can use state behind the scenes in the interpreter without it ever even being exposed to the user. Here, we need to expose it.

I think we can do exactly the same thing with typeclasses, perhaps a bit more simply. First, just to make sure we're on the same page: in the Free version I have up right now, we have this:

runIO :: PandoAction nxt -> IO nxt

runTest :: PandocAction nxt -> Testing nxt

where Testing is a ReaderT State monad. Are you describing the following:

runIO :: PandoAction nxt -> IO nxt

runTest' :: PandocAction nxt -> Testing nxt

-- free monad version

runPure :: PandocAction nxt -> nxt
runPure x = runState (runReaderT (runTest' x) def) def

and then the user could do:

runPure $ do
  setTime midnight
  setReferenceDocx someByteString
  writeDocx opts docx

This, I think, is the best we could do as far as hiding state and still letting the user set state.

But, like I said, I think the typeclass approach lets us do basically the same thing. We don't have runIO anymore, because that's implied by setting PandocMonad to IO. But for the pure version, we can just have

-- typeclass version

runPure :: Testing ByteString -> ByteString
runPure tbs = evalState (runReaderT tbs def) def

foo = runPure $ do
  setTime "midnight"
  setReferenceDocx someBlob
  writeDocx opts docx

Note that we don't need to have the subordinate runTest' (or whatever it would be called). And because of how runPure is defined, the compiler would infer that it should output the pure state version from writeDocx.

I'm not sure this is an argument for the typeclass version, but I don't see how the free monad version allows us to hide the details any more. In fact, it seems slightly clumsier, because of the different explicit interpreter we'd need to run.

jkr commented 7 years ago

Sorry, for more generality, runPure in the second example should be Testing a -> a

jgm commented 7 years ago

That's basically what I had in mind, but on reflection I think we might want to consider something a bit fancier:

newtype PandocPure a = PandocPure (State PandocPureState a)
newtype PandocIO a = PandocIO (StateT PandocIOState IO a)

runPure :: PandocPure a -> a
runIO :: PandocIO a -> IO a

Here PandocPureState would include all the fake real-world stuff we need, but also reader and writer options (which could therefore be set within the monad by the user) and accumulated warning messages. PandocIOState would be similar but without fake real-world stuff.

So a user could do something like:

convert :: String -> ([String], String)
convert inp = runPure $ do
  setReaderOption readerSmart True
  setWriterOption writerColumns 50
  doc <- readMarkdown inp
  let doc' = walk emphToCaps doc
  out <- writeDocBook doc
  warnings <- getWarnings
  return (warnings, out)
jgm commented 7 years ago

On this approach:

Note that in the above example (convert) you could replace runPure with runIO and leave everything else the same; this would make the type String -> IO ([String], String).

As for warnings, maybe we'd need some kind of switch to allow warnings to be emitted in a streaming fashion for the IO case, instead of collecting them all. But giving the option to collect them (even in IO) seems desirable; consider a web application, for example, that would want to display the warnings on a web page rather than having them go to stderr.

jgm commented 7 years ago

@jkr I put up a typeclass branch at jgm/pandoc -- rebased onto master from your typeclass branch.

jkr commented 7 years ago

So PandocPure and PandocIO would be the instances? And given the convert function above, I take it that we could imagine all readers and writers having the types

readFoo :: PandocMonad m => ByteString -> m Pandoc

writeBar :: PandocMonad m => Pandoc -> m ByteString

(not getting into the technicalities of whether/how we want to unify our string types, though that might be a conversation we want to have). Just wanted to make sure I've got this.

Note that in the above example (convert) you could replace runPure with runIO and leave everything else the same; this would make the type String -> IO ([String], String).

I take it we would do runIO in the binary for all cases? But how would users (of the library) know if they have to, for their particular reader and writer? Right now, for example, we'd get the same thing with runIO . readMarkdown and return . runPure . readMarkdown. But not with, say, readT2T, which has macro expansion. I guess that's what documentation is for.

jgm commented 7 years ago

+++ Jesse Rosenthal [Nov 21 16 15:57 ]:

So PandocPure and PandocIO would be the instances? And given the convert function above, I take it that we could imagine all readers and writers having the types

readFoo :: PandocMonad m => ByteString -> m Pandoc

writeBar :: PandocMonad m => Pandoc -> m ByteString

(not getting into the technicalities of whether/how we want to unify our string types, though that might be a conversation we want to have).

That was the idea -- both readers and writers would be parameterized in PandocMonad. The string type is another issue. We could unify everything to ByteString, I suppose, but that doesn't feel like the right thing to do with the textual ones...

Note that in the above example (convert) you could replace runPure
with runIO and leave everything else the same; this would make the
type String -> IO ([String], String).

I take it we would do runIO in the binary for all cases?

Yes.

But how would users (of the library) know if they have to, for their particular reader and writer? Right now, for example, we'd get the same thing with runIO . readMarkdown and return . runPure . readMarkdown. But not with, say, readT2T, which has macro expansion. I guess that's what documentation is for.

Yes, that's what I had in mind.

Also, we need to think about error handling. We don't want any 'error' bombs in this code, especially if we want everything to be useable "pure." You said you were investigating making these instances of MonadError? It would be nice to have some uniform method of throwing errors that behaved sensibly in both monads.

This interacts with things like include files. Say we run a reader in PandocPure and it sees "\include{myfile}". And let's assume that there's nothing in the "fake real world" representing the contents of myfile. I guess we throw an error, right? So the user of the library would soon find out that the reader only works in "pure" when a fake include file is provided?

jkr commented 7 years ago

Also, we need to think about error handling. We don't want any 'error' bombs in this code, especially if we want everything to be useable "pure." You said you were investigating making these instances of MonadError? It would be nice to have some uniform method of throwing errors that behaved sensibly in both monads.

Yes, the difficulty before had to do with the specifics of the Free monad. It should be pretty easy in the typeclass version. For the IO version, we get MonadError for free from the IO. And for the pure version we just have to add an ExceptT to the stack (I have to refresh my memory on the pros and cons of doing it on the inside vs. the outside). Or we could an EitherT to both stacks.

Then so long as we mandate that all of the readers/writers only use the P.namespace functions, we can handle errors in a sensible way. If we wanted to we could even rewrite the IO functions so they would be safe:

instance PandocMonad PandocIO where
   readFileLazy = (liftIO . BL.readFile) E.catch (doSomethingSafe)
   ...
jkr commented 7 years ago

This interacts with things like include files. Say we run a reader in PandocPure and it sees "\include{myfile}". And let's assume that there's nothing in the "fake real world" representing the contents of myfile. I guess we throw an error, right?

Well, I guess it should mirror what we do in IO. But if we have safe versions we could catch errors in IO as well. At the command line, the caught errors would exit the program and give an error message. But in the programmatic version, even in IO, it would give the user a Left PandocError (or whatever) that the user could decide what to do with.

In other words, from an executable standpoint, it would look the same. But in the library, both IO and pure would catch errors in the same way. So to go back to our previous signatures, something like:

runPure :: PandocMonad a -> Either PandocError a
runIO :: PandocMonad a -> IO (Either PandocError a)

[EDITED JGM: s/PandocPure/PandocMonad/; s/PandocIO/PandocMonad/]

jgm commented 7 years ago

wanted to we could even rewrite the IO functions so they would be safe:

Yes, good thought. That's probably worth doing. We'd need to give some thought to the PandocError type, e.g. giving it a constructor that can store IO errors and the like, so no information gets lost.

jkr commented 7 years ago

@jgm: I just pushed an update to Text.Pandoc.Class that introduces the opaque newtypes PandocIO and PandocPure. PandocIO is MonadIO. They're both instance of MonadError, so I've started implementing some of the safe functions above, which should work with throwError. See for example the implementation of readFileLazy in PandocIO. To avoid compatibility problems, fetchItem and fetchItem' still have an Either output, but that's redundant now with the MonadError instance.

Which is all to say that runIO and runPure have signatures m a -> Either PandocExecutionError a. Right now, PandocExecutionError only has one form (PandocFileReadError String), but, of course, sky's the limit there.

For the sake of compatibility, I'm also exporting runIOorExplode, which is just runIO, but it errors out if it hits a Left. So that's what's in Text.Pandoc.Pandoc now.

jkr commented 7 years ago

Which is all to say that runIO and runPure have signatures m a -> Either PandocExecutionError a

Sorry, misspoke. runIO has type PandocIO a -> IO (Either PandocExecutionError a)

And, of course, happy Thanksgiving.

jgm commented 7 years ago

@jkr - sounds great.

jkr commented 7 years ago

You might have to take a look into Custom to see what would be required All the Lua functions seem to be IO, but I'm not sure I understand the interop enough to see the best way to wrap that (and to purify it).

jkr commented 7 years ago

By the way (and this might have been clear to you from the beginning) one nice side effect of having PandocMonad be an instance on MonadError this is that all readers and writers will be able to run throwError in a unified and pure way, without needing to build it into each reader/writer's monad stack. I'm currently going through and changing all writers to PandocMonad m => WriterOptions -> Pandoc -> m String, which is currently just tacking a return $ on the toplevel function. But it does mean in the future, we can throw errors cleanly. When we do runIO we can have them self-destruct (or not) but when we do runPure, they'll just go into the Left.

jkr commented 7 years ago

The more I look at it, the more I think the Custom writer doesn't really fit in with the proposed new architecture. Since it's necessarily IO, we can't pretend, as we currently do, that it's a normal writer (currently when it's called, at the toplevel, we pretend it's an IOStringWriter, but that sort of always-IO type won't exist anymore). Suggestion: treat it at the toplevel as if it were JSON output + IO post-processing. In other words, internally, we'd treat

pandoc foo.md -t custom/script.lua

as if it were

pandoc foo.md -t json | custom/script.lua

where we understand the internal lua runner to convert json first.

This might not be a big difference (other than a minor slowdown) but it does keep us from trying to force the Custom writer into a parameterized box, which might make the code a bit less hacky.

jgm commented 7 years ago

+++ Jesse Rosenthal [Nov 25 16 19:36 ]:

By the way (and this might have been clear to you from the beginning) one nice side effect of having PandocMonad be an instance on MonadError this is that all readers and writers will be able to run throwError in a unified and pure way, without needing to build it into each reader/writer's monad stack.

Yes, that's going to be super-helpful.

In addition, all the readers/writers will be able to issue warnings. I just added warnings for unconvertible math to the docx writer, for example -- and we could add them to every writer with this change.

jgm commented 7 years ago

It seems rather silly to convert to JSON just to convert back again to Pandoc. Note that custom writers aren't in the list of writers exported by Text.Pandoc. Custom writers are treated specially in pandoc.hs.

So I don't think they pose any special problem, we can just leave them in IO, can't we? They won't have the same type as other writers, but I think that's okay because they're already treated specially.

+++ Jesse Rosenthal [Nov 26 16 07:31 ]:

The more I look at it, the more I think the Custom writer doesn't really fit in with the proposed new architecture. Since it's necessarily IO, we can't pretend, as we currently do, that it's a normal writer (currently when it's called, at the toplevel, we pretend it's an IOStringWriter, but that sort of always-IO type won't exist anymore). Suggestion: treat it at the toplevel as if it were JSON output + IO post-processing. In other words, internally, we'd treat pandoc foo.md -t custom/script.lua

as if it were pandoc foo.md -t json | custom/script.lua

where we understand the internal lua runner to convert json first.

This might not be a big difference (other than a minor slowdown) but it does keep us from trying to force the Custom writer into a parameterized box, which might make the code a bit less hacky.

— You are receiving this because you were mentioned. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

References

  1. https://github.com/jgm/pandoc/issues/2930#issuecomment-263069405
  2. https://github.com/notifications/unsubscribe-auth/AAAL5DMpYBcRcrAfl3PHTbFt7AhSRW3Aks5rCFDvgaJpZM4Iiem3
ickc commented 7 years ago

Is the free monad approach abandoned in favor of the typeclass approach? Sorry to ask this, I saw a lot of commits over there but no discussion here.

jgm commented 7 years ago

+++ ickc [Dec 06 16 06:44 ]:

Is the free monad approach abandoned in favor of the typeclass approach? Sorry to ask this, I saw a lot of commits over there but no discussion here.

Yes. We decided that the typeclass approach would be much easier to implement, and that the free monad approach didn't seem to have any real advantages over it (for our purposes).

mpickering commented 7 years ago

I skimmed this discussion and it was interesting. I think you're right that ultimately using a type class is easier and in fact equivalently powerful. After all you can write an instance for the the type class which interprets each operation into the free monad and likewise an interpreter for the free monad which interprets each constructor as the type class method.

jgm commented 7 years ago

+++ Matthew Pickering [Dec 11 16 13:29 ]:

. After all you can write an instance for the the type class which interprets each operation into the free monad and likewise an interpreter for the free monad which interprets each constructor as the type class method.

Good point!

jgm commented 7 years ago

Since the typeclass branch has been merged, we can close this. There are still going to be changes to the details.