Allow markup and/or specifying the language in all human readable strings

manton / JSONFeed

The JSONFeed.org website

Creative Commons Zero v1.0 Universal

940 stars 55 forks source link

Allow markup and/or specifying the language in all human readable strings #62

Open frivoal opened 7 years ago

frivoal commented 7 years ago

For ids, urls, version… that's not needed, but in all places where natural languages are expected (author names, titles, descriptions...), it must be possible to at the very least specify which human language is used, and also to use markup.

Without a language tag, text-to-speech engines, from Alexa and Siri to accessibility tools for blind people, will be unable to reliably read things out loud properly.

Also, in some languages, text rendering looks wrong if the language is not specified, because of some ambiguities in unicode.

Also, some languages require markup to be written properly, and would need to drop content if forced into plain text. The classical example being Chinese/Japanese/Korea with ruby markup: 豊臣とよとみ秀吉ひでよし

See also this article.

manton commented 7 years ago

Thanks for the pointer to that article. This is a challenge in JSON because you can't just add a language attribute like you might in XML. And allowing markup in all fields makes it more difficult for feed readers to implement, and opens up new questions of what markup should be allowed. (You probably wouldn't want <img> or <table> in title text, for example. Native apps may end up stripping markup since not all controls can render HTML, which would negate the advantage.)

You're right that this is a limitation. I'm just not sure how to solve it without introducing new problems. Perhaps specifying the default language for the whole document would be a first step?

frivoal commented 7 years ago

Perhaps specifying the default language for the whole document would be a first step? Yes, that's certainly a start. If you can have a language for the whole document, and op

Suggestion for language only:

make a mandatory language field at the top level.
make an optional language field per item
allow overriding it for every textual field field

{
    "version": "https://jsonfeed.org/version/1",
    "language": "en-US",
    "title": "My Example Feed",
    "home_page_url": "https://example.org/",
    "feed_url": "https://example.org/feed.json",
    "items": [
        {
            "id": "2",
            "language": "fr-FR",
            "title": "Le titre",
            "content_text": {
                "language": "ja-JP"
                "value": "二つ目のアイテムです。",
            },
            "url": "https://example.org/second-item"
        },
        {
            "id": "1",
            "content_html": "<p>Hello, world!</p>",
            "url": "https://example.org/initial-post"
        }
    ]
}

I suspect you're not going to like the third part, but having 1 and 2 would already be an improvement.

As for allowing arbitrary markup (either always, or optionally a special field like language indicating the type), I think it would be fine. You can always render the HTML into something simpler, but you cannot come up with markup if you start with text.

manton commented 7 years ago

I like specifying language at the top level and item level. Seems like that would cover the most common needs for this. Thanks!

hsivonen commented 7 years ago

make a mandatory language field at the top level.

make an optional language field per item

Makes sense. The author object should probably allow a language key, too. Should there be "dir": "rtl", too, on all current objects to declare the general text direction for the human-readable fields of the object as right-to-left?

allow overriding it for every textual field field

Seems excessive when item content can be HTML, so if it's really necessary for item title and content to differ in language, the language metadata for content could go into content HTML.

And allowing markup in all fields makes it more difficult for feed readers to implement, and opens up new questions of what markup should be allowed.

Allowing markup in every human-readable string indeed seems like it would go against simplicity. For the two examples mentioned (the other via linked article), bidi and ruby, Unicode provides control characters that could be used in plainish text without opening the can of worms of full HTML in all strings (and instead creating the issue of HTML-based viewers requiring conversion from control characters to markup, which isn't prohibitively complex). (U+FFF9 to U+FFFB, inclusive, for Ruby.)

It would be interesting to have stats both from Atom and from text in general on how common ruby actually is titles and author names in contexts like feeds. (Obviously, in theory, ruby is applicable to name pronunciation in particular, but it's a different matter if it would be used in a by-line context.)

frivoal commented 7 years ago

Should there be "dir": "rtl", too, on all current objects to declare the general text direction for the human-readable fields of the object as right-to-left?

I suspect that would be good, yes.

so if it's really necessary for item title and content to differ in language

I'm not saying it happens a lot, but it does happen. Here's one, and here's one more

As you said, marking up author names in a different languages is probably much more frequent though.

Without looking too far, your own blog contains articles in Finnish with summaries in English which wouldn't be out of place in the item.summary field. Presumably, you'd still want to tag the item as being in Finish though, since it's title and contents are. For a good bit of fun, I suggest taking such an English summary, pasting it into google translate, telling google translate that it is Finnish, and then asking it to read it aloud. I suppose that's similar to what a feed reader built into Amazon Alexa (or a feed reader for blind people) would give you as a description of a feed item you could ask it to read to you, and that's not particularly nice.

Since such things can happen, ideally it would be addressed through a generic mechanism. Parsing the HTML in the item content to find information about some strings in the host JSON seems way more awkward. How would you do it? string matching? JSON-LD in HTML in JSON? RDFa? That all seems terrible. Or maybe you're suggesting that they would not be linked and that you would get language information when reading the article, but could not map it back to the other fields in the JSON file, that's quite bad form the point of view of a text-to-speech engine going over the article list in your feed reader, for instance.

As for markup in things that are currently plain text only, this isn't just for internationalization, but again, without looking to far, the first entry in the Atom feed of your blog uses markup in the summary (<a>, <i>, <code>…). So it really isn't hard to find cases where you'd want that.

(U+FFF9 to U+FFFB, inclusive, for Ruby.)

Really? Yes, this is a thing that exists, but do you expect anything out there to actually support this? Browsers don't, for a start. Neither does Apple TextEdit, Apple Pages, or LibreOffice. Nor does vi or the OS X terminal, even though Unicode says this is for use in terminals.

Passing the content on to an html parser/UA is unrealistic, but expecting all feed readers out there to implement specialized ruby handling is? Come on. If you think supporting ruby is not relevant, please say so. Don't push this kind of straw-man.

hsivonen commented 7 years ago

I'm not saying it happens a lot, but it does happen. Here's one, and here's one more

Those are good examples of cases of the title differing in language. (They also happen to be cases where the site cares about language metadata in general but still hasn't gone through the trouble of actually providing title-specific language metadata.)

Without looking too far, your own blog contains articles in Finnish with summaries in English which wouldn't be out of place in the item.summary field.

Good point. I was thinking of title vs. content and forgot about summary.

Note, however, that on my feed, the summary contains both Finnish and English, so if there was a need to flatten it to plain text, there wouldn't be a single language tag appropriate for the whole string.

Parsing the HTML in the item content to find information about some strings in the host JSON seems way more awkward.

I didn't expect the use case to be finding out the language of an HTML blob on the JSON layer but just to have the language metadata available for font / voice selection and such when presenting the HTML using some mechanism that parses the HTML (with a proper HTML parser) anyway.

So if the item had a French title but English content, I meant claiming the item language as French and wrapping the content in <div lang=en>. (Which is logically sufficient for rendering but is wrong if the use case is some sort of data wrangling task like counting English-language posts.)

Yes, this is a thing that exists, but do you expect anything out there to actually support this? Browsers don't, for a start.

As I said, HTML-based rendering would have to involve a conversion to markup. The conversion would not be difficult, but I concede that even if the spec gave a simple algorithm, feed readers might still not bother to implement it.

Neither does Apple TextEdit, Apple Pages, or LibreOffice. Nor does vi or the OS X terminal, even though Unicode says this is for use in terminals.

If a feed reader wanted to render titles using a non-HTML-based widget, then having HTML markup in the title doesn't help, either.

Passing the content on to an html parser/UA is unrealistic, but expecting all feed readers out there to implement specialized ruby handling is?

I didn't suggest that passing content to an HTML parser is unrealistic. Making all strings carry HTML may still be undesirable.

If you think supporting ruby is not relevant, please say so.

As the last paragraph in of my comment implied, I suspect that that's the case for titles and author names, but I don't have data to back it up.

Don't push this kind of straw-man.

Not a straw man, but an observation of what's representable in plainish text. I gather the bidi-related controls are more commonly supported than the ruby ones, so even if using the ruby ones was rejected as impractical, for the bidi concern it might work to have JSON-level main text direction and then let e.g. titles be plain text with control characters for the inline bidi concerns instead of going all the way to HTML.

Going to HTML means that common characters like < and & need escaping in what feed emitter authors will easily mentally model as plain strings (titles, author names) for some more or less naive value of "plain". Using less common code points for less common stuff has the benefit of not tripping everyone up with < and & but the downside of rare case handling more likely getting left unimplemented (even if the spec provided an algorithm) if the lack of handling isn't obvious in the common case.

hsivonen commented 7 years ago

expecting all feed readers out there to implement specialized ruby handling is? Come on.

It sounds like you are arguing that ruby is simultaneously important enough to support but not important enough that feed readers would have ruby-specific code. I don't believe the ruby feature can be had for free like that. Specifically, when developers (rightly or wrongly) may mentally model titles and author names as "plain" strings and mentally model JSON strings presumptively "plain", making everyone escape their < and & isn't "for free".

frivoal commented 7 years ago

It sounds like you are arguing that ruby is simultaneously important enough to support but not important enough that feed readers would have ruby-specific code.

Yes. Because this is meant to be part of the World Wide Web, it ought to work for everyone. At the same time, I cannot expect individual application authors to all be knowledgeable about all languages in the world, and their particular quirks and needs. On the other hand, reusing as-is parts of the platform (HTML) which have been designed to handle it all is something you can realistically expect reading software to do. Yes, doing so is harder than not. But no, I do not think that HTML support is something that software authors would accidentally dismiss as niche or fail to notice.

I do agree that pretty often, strings will be enough, but I think that occasions to use markup if available will not be rare. Therefore, I think the format should give authors a choice of plain text (for when they have simple needs and don't want to worry about escaping and whatnot) or markup (to cover for all fancy cases).

If a feed reader wanted to render titles using a non-HTML-based widget, then having HTML markup in the title doesn't help, either.

Yes, but while HTML -> plaintext is a lossy operation, you start with all the information. The opposite direction starts with an information deficit.

Not a straw man, but an observation of what's representable in plainish text.

I agree it can be represented with plainish text, but I don't think it would be useful to do that. My goal is not to identify one by one the things people might need, figure out new approaches to solve each individually, and hope implementors will care. I'd much rather take on a dependency on a layer that is well known for having solved these problems.

hsivonen commented 7 years ago

Yes. Because this is meant to be part of the World Wide Web, it ought to work for everyone.

I think it's quite an exaggeration to suggest that not supporting ruby in titles and author names make the format not work world-wide.

I think failing to support bidi would be a defect that would make the format not work world-wide, since it would exclude writing systems outright.

Ruby, however, is a relatively rare typographic device even in the context of the writing systems where with which it is used rather than an always-present feature of the applicable writing systems. At the risk of starting a debate on whether it's appropriate to compare typographic devices across writing systems, I note that e.g. using bold and italic text are, as a typographic devices, more common in the writing systems that use them than ruby is in the writing systems that use it, but not supporting bold or italic text titles is not a matter of not making a format not world-wide applicable.

I agree it can be represented with plainish text, but I don't think it would be useful to do that.

The usefulness is that those control characters don't take away common characters that have other semantics (< and &), so if the control characters were used, it feed producers who don't use the feature wouldn't need to pay for the feature is the form of escaping something. (Either way, feed readers would have to do something.)

frivoal commented 7 years ago

think it's quite an exaggeration to suggest that not supporting ruby in titles and author names make the format not work world-wide.

That is true. Ruby on its own is not a critical feature of any language. It is a thing that exists and is useful in some, but no ruby support is not the end of the world.

Nor is using <abbr> with a title. You can cope with its loss. It's also OK when you have a piece of text that contains multiple languages if you don't mark them up properly. It's also not that bad you can't use <sup> or <sub>. It will be inconvenient in some cases, but people will cope.

However, it's not like we have to solve these problem from scratch. They are already solved: HTML is a good format to represent human language. As it is proposed now, JSONFeed is not only worse than the web at coping with the variety of human language and all its subtleties, it is worse that RSS and Atom are at it.

Especially given that JSONFeed has a decent chance of catching up thanks to JSON's simplicity over XML, that seems like a bad move to me.

Concretely, I think that JSONFeed readers will have to handle HTML anyway. Since the cost is eaten anyway, having a generic way of using either plain text when its enough or html when plain text falls short for every human language text field seems better. Sure, it is less often useful in titles than in summaries or names, and less often useful in there than in the content. But even if less useful, it's not not useful, and you're already have to have code to process HTML anyway, so you might as well do authors a favor.

manton commented 7 years ago

This is all good feedback. It seems most people agree that language should be added, but there are tradeoffs with allowing HTML in titles. You probably don't want to just put HTML in the existing title field, since there will be many feed readers that won't expect it there. So then you're going to need title_text and title_html to match the content_ fields. Just not sure about that kind of change for something that won't be needed very often.

dissolve commented 7 years ago

We have had to go through this all before in the Social Web WG https://github.com/w3c/Micropub/issues/37 For Micropub the UTF-8 Bidi markers were enough so dir or anything similar was not necessary.

For specifying language, the simplest way is to put a language tag that can be anywhere, JF2 does this as well, and thus you can specify a language tag for an author, for an entire post, for the entire document. Lower level overrides higher level.

For multiple languages of the same text... Is it needed? From what I have seen, no. This can be done in HTML within the content_html body. It is done in practice on the web and on Facebook by just listing the article content twice, once in each language. If posting multiple language versions of the same content, I would sort of expect them to be multiple posts, one on /en/ one on /jp/ for example, and thus likely separate feeds. This is how it is also usually done on the web. We went back and forth on this in work on AS2. https://www.w3.org/wiki/Socialwg/2015-06-16-minutes#AS2_language_support JSON-LD lets you do things like specifying any number of languages for fields, but it gets super messy.

dret commented 7 years ago

just as a design option (not one i particularly like, but here you go): https://tools.ietf.org/html/rfc5987 has a magic naming convention for i18n: use the field name with a trailing asterisk. then after this, you'd probably have an array of lang/content pairs. it works, technically. but:

most implementations will probably simply ignore this hard-to-parse construct
there's not indication of the default language