manubot / rootstock

Clone me to create your Manubot manuscript
https://manubot.github.io/rootstock/
Other
451 stars 178 forks source link

Journal compatibility #4

Open agitter opened 7 years ago

agitter commented 7 years ago

I'm excited to see this standalone manuscript repository!

I have a general question in regards to journal submissions. Many journals require Word or LaTex formats for submission. Have you thought about how manuscripts written in this markdown format can be submitted to a journal with those requirements? Would one use pandoc outside of the automatic build to do a one time conversion to Word or LaTeX?

dhimmel commented 7 years ago

You should be able to adapt our pandoc commands to convert to Word or LaTeX. There's a chance some things will break... Perhaps try --to=docx or --to=latex in the deep review pandoc command.

One other option is to copy the rich text from the rendered HTML and paste that into a word processor. This works surprisingly well, despite being a sort of hack.

slochower commented 7 years ago

This is a good point and one I've run into personally a few times. You can definitely get a Word .docx out of pandoc. Problems crop up when a journal insists the authors to use a Word template or in-house styles (e.g., Science). In my experience, pandoc has partial functionality for many cases, but it gets ugly and I'm not sure there is any way around it besides manual tinkering.

An alternative approach by the R folks has been to specifically craft R Markdown templates for specific journals: https://github.com/rstudio/rticles, but I think that approach deviates from the Swiss army knife features of pandoc, is a lot of effort, and will become out of date when the publishers change something.

evancofer commented 7 years ago

With the current approach, the LaTeX output from Pandoc includes the authors and their affiliations as text in the document body rather than using the built-in "author" field for this. The same issue applies to the title, which is instead placed in as a section. Subsequent sections are placed in as subsections. Lastly, the superscript fields for author affiliations (i.e., <sup>) do not translate from html5 to LaTeX. If this is something you have an interest in addressing, I can contribute a PR for it.

dhimmel commented 7 years ago

@evancofer one of the main motivations of to be a next generation manuscript creation system that frees itself from the limitations of LaTeX. So I'm hesitant to start modifying the core implementation for LaTeX compatibility. On the other hand it's nice that to be able to convert from the markdown source to any format that has utility for you...

What's the specific problem you're trying to solve? If it's submission to journal, is the docx export (or copy and RTF paste) not an adequate solution? Going forward I see more journals accepting PDF initial submissions and then markdown or perhaps JATS XML for final submission.

If this is something you have an interest in addressing, I can contribute a PR for it.

What would be most compelling would be if you found a way that defines the titles and authors in a way that works for all output formats. Is there a pandoc standard for how this information should be passed as input? This would allow proper LaTeX export as well as any other format

evancofer commented 7 years ago

I have investigated methods to define authors in pandoc. Authors and the title must be incorporated as metadata (see here for what I think might be a solution). This seems to be the recommended approach. If this works, it would then be trivial to change top-level sections (e.g., abstract) from ## to # .

The reason <sup> is rendering in the current format is because of the compilation to a PDF from HTML5 that occurs after the Markdown parsing. Any other conversion method (e.g., from Markdown to .docx file, etc.) will fail to properly render superscripts. I suspect that they are undefined in the Markdown standard.

It is troubling that seemingly-essential aspects of lighter formatting (e.g., superscripts) and heavier formatting (e.g., math) are not part of Markdown's core features. This is concerning if this is to be the next-generation manuscript system. For one, it means that changes in various extensions can greatly alter how a Markdown document might be rendered; an unnecessary dependency management issue. With regards to .docx files, many venues do not accept them at all (e.g., arXiv, ICML). Conversely, nearly all accept LaTeX files. Part of this is due to .docx's rendering inconsistency across platforms, but part is that LaTeX can be directly transformed into PostScript. PostScript (i.e., ".ps" files) is a programming language and the industry-standard format for publishing and printing. It has been around for over 30 years. For several reasons (e.g., poor compression and quality, limited functionality, and security risks) few printers and publishers use PDF as their primary document format. Though it is possible to convert from PDF to PostScript, it is very lossy; after all, a PDF is just PostScript that has been executed and rendered. Given all of these issues, it may be worth supporting LaTeX (or PostScript if you are not a fan of LaTeX) in addition to the other formats.

slochower commented 7 years ago

@evancofer I mostly agree, although...

Conversely, nearly all accept LaTeX files

has not been the case in my experience, unfortunately. Many of the biologically-focused journals that I've encountered are reluctant to accept LaTeX source and when they do, they have a bunch of restrictions. But in those cases, I haven't had too many issues uploading PDF.

It is troubling that seemingly-essential aspects of lighter formatting (e.g., superscripts) and heavier formatting (e.g., math) are not part of Markdown's core features.

I think this goes back to the origin of CommonMark around 2004 -- the author of pandoc, John MacFarlane, also worked on the CommonMark standard -- and I agree with the rest of what you said.

dhimmel commented 7 years ago

@evancofer nice find with the pandoc_title_block extension. I think we should populate all three fields: title, authors, date. Authors and date would be inserted by the build system using jinja2. One big advantage is that pandoc would presumably transmit this information as HTML/PDF metadata. This would help with citing our outputs via Greycite (which currently does not work well). If you want to proceed here with a PR, I can create the authors.tsv discussed in #7.

The reason <sup> is rendering in the current format is because of the compilation to a PDF from HTML5 that occurs after the Markdown parsing.

I'm not a fan of including the superscripts in the pandoc_title_block. You could always use unicode superscript digits (⁰¹²³⁴⁵⁶⁷⁸⁹), but I think this is wrong since it would show up in the article author metadata. It looks like we can also use a yaml_metadata_block. The docs provide the relevant example:

---
title: The document title
author:
- name: Author One
  affiliation: University of Somewhere
- name: Author Two
  affiliation: University of Nowhere
...

And it looks like we could also use pandoc variable templating rather than jinja2?

So right now, I'm less certain about what to do when this issue started. I think we do want to fill out the pandoc metadata (title, author, date). Given that, we have to decide how we want to show author affiliations and other author info (such as email or contributed equally symbols) as discussed in #7. In addition, we should consider writing our variables to a yaml_metadata_block and then removing jinja2 entirely (which would help @slochower with #8).

It is troubling that seemingly-essential aspects of lighter formatting (e.g., superscripts) and heavier formatting (e.g., math) are not part of Markdown's core features.

Yes, it's not ideal, but I still think markdown is preferable to asciidoc for its ease of use. I think standards for markdown extensions have begun to emerge, and we'll respect those as much as possible (e.g. #2).

We've sort of veered of course of the initial purpose of this issue, but have brought up some important points. To keep things moving, I suggest the following stance: HTML and PDF output are the only supported outputs that will work without any modifications. For other output types, such as docx and LaTeX, we will address incompatibilities on a case-by-case basis.

evancofer commented 7 years ago

@dhimmel Sounds good. I will get working on the PR. It should be up early this week (i.e., before Wednesday).

dhimmel commented 7 years ago

I will get working on the PR. It should be up early this week (i.e., before Wednesday).

Awesome. I'd suggest opening the PR while it's still a work in progress, so we can get design discussion going early on.