ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

set_textType and friends need better markdown support #298

Open cboettig opened 4 years ago

cboettig commented 4 years ago

I don't think we have a very good way for users to supply markdown input. (The current examples showing support for markdown files still use the 2.1.1 strategy of converting to docbook via pandoc, which is error-prone as a EML doesn't support all docbook terms).

We can get valid 2.2.0 markdown 'manually', e.g. something like:

methods <- list(methodStep = 
                           list(description =
                                 list(markdown = paste(readLines("methods.md"), collapse = "\n"))))

but it would be much better to support something closer to the current sytnax in the example, e.g. maybe:

abstract <- set_TextType(markdown = "abstract.md")
methods <- set_methods(markdown = "methods.md")

(Functions would also need a check to make sure we were in 2.2.0 mode).

Anyone generating EML with markdown sections currently? Do you do something like the above? Any takers on a PR on this one?

amoeba commented 4 years ago

Good catch. Agree on the current implementation being a bit error-prone. EML 2.2.0 got Markdown support in part because of its popularity and ubiquity but also because the pre-2.2.0 TextType module was really hard to use. I think defaulting to a Markdown-first experience from here on out is a good idea.

Another API it'd be nice to support would be

list(methodStep = list( # or set_methods(markdown = ) instead
  list(description = list(
    markdown = "# Header

Example text.
"))))

so people can keep everything in the same R/Rmd.

@jeanetteclark are you or others on the Data Team using markdown for your TextType elements? @clnsmth, @earnaud?

jeanetteclark commented 4 years ago

@amoeba not usually

clnsmth commented 4 years ago

Thanks for reaching out @amoeba @cboettig. I'm mostly using .docx because that's the preferred file type our data contributors submit their abstract and methods to us. Sadly, I'm using it to remove most of the text formatting before sending it to set_TextType() to bypass the time trouble shooting .docx > pandoc > EML translation issues.

I very much welcome improved support for .md > EML (.xml) and many thanks in advance to whoever takes this on!

cboettig commented 4 years ago

@clnsmth Thanks! that's a great point. I think we should probably replace the existing mechanism which uses pandoc to convert docx and .md into docbook (since too much of modern docbook is just not EML compatible anyway). We can have it instead use pandoc to render .docx files into markdown, and pass-through .md markdown (at least so long as we are in eml-2.2.0 mode). That way, users can continue to rely on .docx as their interface but we'd avoid the myriad docbook translation issues.

I don't have any good solution for what to do about this in eml-2.1.1 mode though where we can't generate <markdown> sections. We could either leave it as is, or perhaps convert to markdown but encode it if it were just a plain text block. This would at least avoid the validation errors. (and we could encourage folks to adopt 2.2.0!)

clnsmth commented 4 years ago

Thanks @cboettig. I think I’m on board with this but have a clarifying question.

In the proposed changes, will at least the current level of support for .txt, .docx, .md > .xml be maintained when EML is operating in eml-2.2.0 mode? This is particularly important for tooling I’ve developed for the non-technical user and would be sad to see it go away : (. Additionally, continued support for .xml when operating in eml-2.2.0 mode supports data repositories and other metadata consumers using EML 2.2.0 but not ready/wanting to handle <makdown> sections.

cboettig commented 4 years ago

Thanks @clnsmth for the feedback, that's good to know. I'm a bit wary of the transforms into docbook xml because they are currently not guaranteed to produce valid EML (basically depending on what kind of markup the user uses), but if they are useful we could certainly make the migration in a non-breaking way instead. It is mostly a matter of figuring out what syntax to use. e.g. perhaps we need a toggle-parameter like:

set_TextType("myfile.docx", as = "markdown")
set_methods("mymethods.docx", as = "markdown")

etc to encode in markdown, and maybe:

set_TextType("myfile.docx", as = "xml")
set_methods("mymethods.docx", as = "xml")

to encode as docbook. we could haggle over whether markdown or xml would be the default. (Making xml the default would obviously be the backwards-compatible option, but I think it also seems to be creating a somewhat unnecessary gotcha for new users).

Thoughts?

Thanks again for the feedback!

cboettig commented 4 years ago

p.s. accidentally posted the above half-way through writing it, so if you're following via email, see updated thread on GitHub. apologies.

amoeba commented 4 years ago

Good point on backwards compatibility, @clnsmth. I think the ideas here are to maintain the ability to bring .txt, .docx, and .md docs into EML via this package but change what happens. The difference is what happens depending on the EML version (emld_db):

  1. emld_db = 2.1.1 or <: Convert txt, md, docx, etc. to DocBook sections. Current behavior.
  2. emld_db = 2.2.0 or >: Convert txt, md, docx, etc. to Markdown and place into <markdown> elements by default.

The current function signature of set_TextType is:

function(file = NULL, text = NULL)

and I think we can get all of this working with a change to

function(file = NULL, text = NULL, markdown = TRUE)

The function will retain the first part of its signature and the added markdown argument controls whether the function makes use of the EML 2.2.0 markdown element, defaulting to TRUE. Here's a confusion matrix to get an idea:

For any file= input, docx, md, or otherwise:

markdown= emld_db Result
TRUE <=2.1.1 DocBook w/ warning
FALSE <=2.1.1 DocBook
TRUE >=2.2.0 <markdown>
FALSE >=2.2.0 DocBook

Note: When using the text argument instead of file, I think you'd still always get the current behavior of setting a simple text type element instead of DocBook-esque or Markdown.

Does this sound workable?

PS: @clnsmth , you said:

Additionally, continued support for .xml when operating in eml-2.2.0 mode supports data repositories and other metadata consumers using EML 2.2.0 but not ready/wanting to handle sections.

Did you mean "continued support for our mini-DocBook, aka section, and para"? If the default behavior of set_TextType("foo") changed such that it output an <markdown by default, would that be a deal-breaker?

clnsmth commented 4 years ago

Thanks @cboettig @amoeba for considering my use case and these clarifying explanations. I'm completely on-board with changing the default behavior to markdown and retaining backwards compatibility through some permutation of arguments while emld_db = 2.2.0.