samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
641 stars 174 forks source link

Making it machine-translatable will make hts-specs available to more people on the planet. #589

Open kojix2 opened 3 years ago

kojix2 commented 3 years ago

Hello.

 I would like to raise an issue from a slightly different perspective here. To be frank, I'm not very good at English. This text is written by DeepL, but on days when DeepL is off, Google Translate does it for me. Without machine translation, my life would not be possible.

 The same goes for reading papers. My intelligence is not capable of reading English papers quickly. I always look at the web page and then use Google Translate.

 And hts-spec .... Oops, hts-spec is not machine-translatable; the PDF has annoying line breaks and weird paragraphs that are not easily machine-translatable.

 This is why reading hts-spec is so difficult. Not only is the content difficult, but it is also difficult to use machine translation. Most people involved in bioinformatics are very smart, so this may not be a problem. Some people can even speak several languages easily. However, most people on the planet are not that smart. I am one of those not-so-smart people.

 I am convinced that providing the hts-spec in a form that can be read by machine translation will help more people. For example, it is an html web page with no line breaks. The hts-specs documentation seems to be generated from tex, but I don't know if it is easy to do so.

 I know this comment may be too candid and somewhat unpolite. However, it contains what is true for me. Thank you for reading.

Translated with www.DeepL.com/Translator (free version)

jkbonfield commented 3 years ago

I know you closed this issue, but inclusivity is still an issue we value. There are tools like latex2html which may do a better job of making something machine translatable. There is also the TeX source, although you'll have to suffer a bit of markup and it may break translation. Or if you've found a better solution yourself, it may be good to note it here so others can find it and use the tips (or perhaps we can add it somewhere else).

You also didn't say which specifications are problematic. Is it all the TeX ones (ie PDF docs), or others?

tskir commented 3 years ago

I'm going to reopen this for now, this definitely sounds like a topic which at least should be discussed

jkbonfield commented 3 years ago

Having html as the primary output may be problematic, at least initially, due to some cosmetic issues. However there is perhaps something to be said for having alternative formats available even if we just list them as a more accessable version with the master version explicitly being PDF.

I tried latex2html -split +0 -info "" -no_navigation on SAMv1 and it produced something, but left quite a lot of markup in there that looked poor. htlatex needed some hand-holding and ignoring of errors, but what it produced was then much superior. Potentially room for improvement to get it working better (albeit with missing bits due to ignoring the errors).

VCF faired better with htlatex. An example:

PDF: image

HTML: image

jkbonfield commented 3 years ago

Looking at the above again I see the Simga combinatorial is incorrectly formatted by htlatex. There may be options to get such things improved, even if it's just getting it to insert formulae as images, but it's obviously not something we can rely on without having to proof read at the moment.

I think this is probably going to be more of a slow back-burner project than something we embrace quickly, unless anyone has spare time to work on it.

claymcleod commented 1 year ago

I wonder if Markdown may be a better choice for this material: at this point, Markdown is relatively ubiquitous and can be easily translated into a variety of mediums (including latex) with pandoc. You also have the benefit of Github's work making markdown accessible within their web platform where your users are. If you like the formatting of the current specs, you can probably pretty easily get it working with your own custom template (here's a link where they show you mostly how to do so).

jmarshall commented 1 year ago

The maintainers of these documents are familiar with Markdown.

claymcleod commented 1 year ago

The maintainers of these documents are familiar with Markdown.

Yeah, sorry, the point here was not say "here's this new technology, markdown" (😄). I actually have recently switched to using Markdown for more and more of the technical design documents that I used to use LaTeX for, and it's been a pleasant experience thus far. It was not so long ago that I wouldn't have considered Markdown an appropriate medium for the SAM specification, but perhaps now the ecosystem could support the full spectrum of required features (cross-compilation into custom LaTeX documents, figure generation, linters, etc).

jkbonfield commented 1 year ago

Some specs here are already in MarkDown - see https://github.com/samtools/hts-specs/blob/master/htsget.md and https://github.com/samtools/hts-specs/blob/master/refget.md.

In my opinion they're not so well formatted (due to limitations of md) as LaTeX, but that's not really the point of this topic as it was about accessability. Being able to target multiple output formats to provide easier to process versions for screen-readers and language translation engines is obviously helpful. That doesn't mean markdown should be the primary document though - it could just as easily be an output from e.g. docbook, asciidoc, or even just using pandoc for latex to md.

However mainly it's an issue of time to evaluate the alternatives and to validate the translations don't introduce glitches (as demonstrated by htlatex above). (FWIW the CRAM spec started life as a Word doc. I extracted the XML from docx and used xslt to transform that to latex - scary! It mostly worked, but still needed quite a bit of editing to fix issues. Eww!)

claymcleod commented 1 year ago

Being able to target multiple output formats to provide easier to process versions for screen-readers and language translation engines is obviously helpful. That doesn't mean markdown should be the primary document though - it could just as easily be an output from e.g. docbook, asciidoc, or even just using pandoc for latex to md.

Agree. I like the idea of using a latex to XYZ converter, but I do not know of any that would work fully with the content of the spec.

Just to see what was possible, I spent about two hours tonight messing around with pandoc to see if I could get something reasonable out of it. As expected, one major limitation is the complex formatting sections: mainly getting the tables to look correct. Beyond just converting it to HTML, I also tried using pandoc to turn the latex document into Github flavored markdown specifically (pandoc -f latex -t gfm SAMv1.tex > SAMv1.md) and render it in Github—that didn't work great either.

Given this experience, I can think of two directions that would both (a) to keep all the features needed and (b) also improve the situation for accessibility:

jkbonfield commented 1 year ago

I too had a play with pandoc for github markdown and it was tragic, even with the latest release. The tikz bit was particularly special! The best I had previously for html was htlatex, but that wasn't perfect on formulae either. There may be options to improve it though, such as generating images for the formulae rather than attempting mathml conversion.

Fundamentally it's just a matter of free time. No one who works on these specs does it as a full time job, and we're down on maintainers already without taking on projects that only progress the presentation rather than content. If we could find a conversion tool that was pretty much flawless then maybe it'd be something we could automate.

jkbonfield commented 1 year ago

I also discovered https://math.nist.gov/~BMiller/LaTeXML/ which sounds ideal, but it doesn't work straight out of the box on our files. I haven't had time to dig around and figure out what shenanigans we're doing that breaks it. Anyway, with appropriate command line options and management it may perhaps work given the pedigree and online demos.

daviesrob commented 1 year ago

Just adding a note here that one of the problems with auto-translated specs is that some text should not be translated - for example the word coordinate when used as the value for a SAM @HD SO: tag. It turns out that you can use the HTML5 attribute translate="no" to control this. Although it may be a good idea to use class="notranslate" as well to make sure.

I guess to do this properly, we should probably get better at marking up things like keywords in our latex source, so if we started producing HTML output as well as PDFs then these attributes could be applied automatically.

drkshitiz commented 9 months ago

This is a good point to discuss. We should make and publish a guidelines document so that the documentation for the tools developed in the future remains machine translatable. We should review and discuss all the available tools including markdown, latex and more and decide on inter-compatible methods to create machine-translatable documentation.