Converting prior AC documents to Markdown

baskaufs commented 6 years ago

In the 2018-05-02 meeting, @nielsklazenga and I said we would take a look at converting existing pages to Markdown (actually, I think the notes say "wiki pages", but I started with the core normative documents - I think the wiki pages will probably be easier and can also be converted with Pandoc). I've opened a new branch called documentation-conversion that contains a new folder called "doc" and which contains the preliminary work.

I used Pandoc to convert the AC Structure and Term List documents from their online HTML form to Markdown. I put some notes here. Pandoc offers several options for Markdown output: their own Pandoc flavor, strict Markdown, and Github-flavored Markdown (gfm). The Pandoc flavor renders terribly on Github, so I forgot about it. I tried both strict and gfm - you can see the results for the term list in strict, the term list in gfm, and the structure document in gfm. The structure document wasn't too bad. It just needs to have some stuff like the navigation menu chopped out. The term list is more problematic.

There are two main problems with the term list document. One is that the hyperlinks within the document are all broken. That's fixable, either through a combination of automatic generation of links to Markdown headings, or through explicitly creating HTML anchors that correspond to the existing links. To see an example of fixing by HTML anchors, I manually fixed the link from "Identifier" in the Index by Label section so that it would jump to the dcterms:identifier metadata. Click this to see what happens and check the Raw to see the markup.

The other problem is that the table format that's used in the current Term List document is totally broken in the converted Markdown. I manually turned the dcterms:identifier metadata into a table, which looks OK, but because of the requirement that there be a header, it's a bit goofy. Also it had to be formatted entirely by hand, which is not very feasible for a document of this size. I was looking for a way to just make a tab after the property, but that didn't work.

I think that we really should make a fundamental decision about how complicated this kind of page needs to be. From the standpoint of maintenance, the simpler the better, so formatting it in a way that can be rendered simply in Markdown is probably better than a format that requires manually-maintained HTML. The other possibility that I think we should seriously consider is to abandon a manually-maintained page for the term list and go with one that's generated on-the-fly from a metadata table. That's where the DwC team is headed. The CSV-based system that I've been working on for the obsolete vocabularies could also work, although it's fairly vanilla. At the moment, you can see an example here although it may disappear at any time in the future. It's being generated from this table.

@ramorrismorris, do you have thoughts about this? I know that the original setup of the pages on the semantic mediawiki system was very labor-intensive. If AC doesn't change much, or quickly, manually maintained pages might be OK, but it's hard to predict the future. It seems to me that if TDWG has learned anything in the last 10 years, it's that simpler is usually better.

ramorrismorris commented 6 years ago

Much as I'm charmed by the wiki form of the ac terms page, and hard as this may be to do, it is probably time to make the wiki page a product, not a definition. The vicious issue is that deep down in any media wiki, changes can occur that have nothing to do with, nor under control of, the wiki ac terms page. At first reading, I a, unsure whether Markdown would be part of a solution...

Bob

On Sun, Jun 3, 2018 at 6:51 PM, Steve Baskauf notifications@github.com wrote:

In the 2018-05-02 meeting https://github.com/tdwg/ac/blob/master/2018-05-02-hangout-notes.pdf, @nielsklazenga https://github.com/nielsklazenga and I said we would take a look at converting existing pages to Markdown (actually, I think the notes say "wiki pages", but I started with the core normative documents - I think the wiki pages will probably be easier and can also be converted with Pandoc). I've opened a new branch called documentation-conversion https://github.com/tdwg/ac/tree/documentation-conversion that contains a new folder called "doc" and which contains the preliminary work.

I used Pandoc https://pandoc.org/index.html to convert the AC Structure https://terms.tdwg.org/wiki/Audubon_Core_Structure and Term List https://terms.tdwg.org/wiki/Audubon_Core_Term_List documents from their online HTML form to Markdown. I put some notes here https://github.com/tdwg/ac/blob/documentation-conversion/doc/pandoc-conversion-notes.txt. Pandoc offers several options for Markdown output: their own Pandoc flavor, strict Markdown, and Github-flavored Markdown (gfm). The Pandoc flavor renders terribly on Github, so I forgot about it. I tried both strict and gfm - you can see the results for the term list in strict https://github.com/tdwg/ac/blob/documentation-conversion/doc/termlist-strict.md, the term list in gfm https://github.com/tdwg/ac/blob/documentation-conversion/doc/termlist-gfm.md, and the structure document in gfm https://github.com/tdwg/ac/blob/documentation-conversion/doc/structure-gfm.md. The structure document wasn't too bad. It just needs to have some stuff like the navigation menu chopped out. The term list is more problematic.

There are two main problems with the term list document. One is that the hyperlinks within the document are all broken. That's fixable, either through a combination of automatic generation of links to Markdown headings, or through explicitly creating HTML anchors that correspond to the existing links. To see an example of fixing by HTML anchors, I manually fixed the link from "Identifier" in the Index by Label section so that it would jump to the dcterms:identifier metadata. Click this https://github.com/tdwg/ac/blob/documentation-conversion/doc/termlist-gfm.md#Identifier to see what happens and check the Raw to see the markup.

The other problem is that the table format that's used in the current Term List https://terms.tdwg.org/wiki/Audubon_Core_Term_List document is totally broken in the converted Markdown. I manually turned the dcterms:identifier metadata into a table, which looks OK, but because of the requirement that there be a header, it's a bit goofy. Also it had to be formatted entirely by hand, which is not very feasible for a document of this size. I was looking for a way to just make a tab after the property, but that didn't work.

I think that we really should make a fundamental decision about how complicated this kind of page needs to be. From the standpoint of maintenance, the simpler the better, so formatting it in a way that can be rendered simply in Markdown is probably better than a format that requires manually-maintained HTML. The other possibility that I think we should seriously consider is to abandon a manually-maintained page for the term list and go with one that's generated on-the-fly from a metadata table. That's where the DwC team is headed. The CSV-based system that I've been working on for the obsolete vocabularies could also work, although it's fairly vanilla. At the moment, you can see an example here http://vuswwg-private.jelastic.servint.net/gom/dwc/dwctype.htm although it may disappear at any time in the future. It's being generated from this table https://github.com/tdwg/rs.tdwg.org/blob/master/dwctype/dwctype.csv.

@ramorrismorris https://github.com/ramorrismorris, do you have thoughts about this? I know that the original setup of the pages on the semantic mediawiki system was very labor-intensive. If AC doesn't change much, or quickly, manually maintained pages might be OK, but it's hard to predict the future. It seems to me that if TDWG has learned anything in the last 10 years, it's that simpler is usually better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/ac/issues/105, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-OI1HyToWNqsdfTuTSMu7Df7DeuGyWks5t5GhqgaJpZM4UYRiM .

-- Robert A. Morris

Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390

Filtered Push Project Kurator Project Harvard University

email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://wiki.filteredpush.org http://wiki.datakurator.org http://taxonconceptexplorer.org/ http://www.cs.umb.edu/~ram

baskaufs commented 6 years ago

The more I think about it, the more convinced I am that the actual terms page should be generated from the same data that serves as the source for the machine-readable metadata. Then there is only one place from which the various representations spring. It would be totally feasible to use software to generate Markdown (or HTML if that's better) for the terms table part of the document and merge it with static header and footer material. When I get some time, I think I'll try to generate Markdown from the data in the tables at https://github.com/tdwg/rs.tdwg.org, which are intended to be the source for the machine-readable representations. I still need to do some work on the borrowed terms first, though.

baskaufs commented 6 years ago

The script to generate the terms page from the metadata in the rs.tdwg.org repo is here: https://github.com/tdwg/ac/tree/documentation-conversion/code . The other three pages (the introduction and structure pages from terms.tdwg.org and the Word format guide) have all been converted via Pandoc and then manually cleaned up as Markdown.

There are still some questions about final cleanup of the documents, but the actual conversion is done. So I'm going to close the issue

tdwg / ac

Converting prior AC documents to Markdown #105