opengeospatial / ogcapi-records

An open standard for the discovery of geospatial resources on the Web.
https://ogcapi.ogc.org/records
Other
59 stars 28 forks source link

Clarification on language(s) #195

Closed m-mohr closed 1 year ago

m-mohr commented 1 year ago

As far as I can see, hreflang is meant to follow RFC 5646 (Language-Tag). For the language property the format seems undefined. I'd propose to clarify that it uses the same format as hreflang.

Additionally, I'm wondering whether it would be helpful to define a list of available/supported languages, e.g. as a property languages, which is an array of languages.

Also, how should alternative representations in other languages be communicated in (static) catalogs? Maybe multiple self links with different hreflangs?

I'm asking because I'm writing this up for STAC and would like to align as much as possible. See also https://github.com/stac-extensions/language and https://github.com/stac-api-extensions/language

pvretano commented 1 year ago

12-DEC-2022: Discussed in the SWG. @pvretano will implement the following:

  1. Change "language" to "languages" which will be an array of available languages.
  2. The language representation will be as per RFC-5646.
  3. In a static record, you would point to the other "lanuage" representations of the record using "alternate" links with the appropriate "hreflang".
  4. In an API you would just use standard language negotiation (i.e. Accept-Language).
m-mohr commented 1 year ago

This was a very quick turnaround, thanks.

I'm confused on point 1: Why replace language with languages? I think both should exist:

  1. language specifies the actual language you've received.
  2. languages specified which languages are available (this is more or less a shortcut for checking language + all alternate links for the set of available hreflang's)

I agree with all other points and will align to use alternate instead of self.

m-mohr commented 1 year ago

Example from STAC:

{
  "stac_version": "1.0.0",
  "stac_extensions": [
    "https://stac-extensions.github.io/language/v1.0.0/schema.json"
  ],
  "type": "Feature",
  "id": "item",
  "bbox": [...],
  "geometry": {
    "type": "Polygon",
    "coordinates": [...]
  },
  "properties": {
    "datetime": "2020-12-11T22:38:32Z",
    "example": "An example product",
    "languages": [
      "de",
      "en"
    ],
    "language": "en"
  },
  "links": [
    {
      "href": "https://raw.githubusercontent.com/stac-extensions/language/main/examples/item.json",
      "rel": "self",
      "hreflang": "en"
    },
    {
      "href": "https://raw.githubusercontent.com/stac-extensions/language/main/examples/de/item.json",
      "rel": "alternate",
      "hreflang": "de"
    },
    {
      "href": "catalog.json",
      "rel": "parent",
      "title": "Example STAC Catalog",
      "hreflang": "en"
    },
    {
      "href": "catalog.json",
      "rel": "root",
      "title": "Example STAC Catalog",
      "hreflang": "en"
    }
  ],
  "assets": {
    "data": {
      "href": "https://cloud.example.com/examples/file.tif"
    },
    "metadata": {
      "href": "https://cloud.example.com/examples/metadata.xml",
      "type": "application/xml",
      "hreflang": "en"
    },
    "metadata_de": {
      "href": "https://cloud.example.com/examples/metatdata_DE.xml",
      "type": "application/xml",
      "hreflang": "de"
    }
  }
}
cnreediii commented 1 year ago

Just a FYI: In a CDB 2.0 datastore, there is a mandatory element 'language' (aka dct:language, PT_Locale) whose content is based on BCP 57 (5646). From the language perspective, OGC API - Records and the STAC API and CDB 2.0 are consistent.

ycespb commented 1 year ago

FYI: In the Testbed-18 ER Secure and Async Catalog (OGC 22-018) section 2.2.2, there is also the following note:

NOTE INSPIRE requires the Discovery Service to advertise the default language in the CSW GetCapabilities response. Proposing a similar mechanism to advertise the default language is further work. Possible approaches include:

m-mohr commented 1 year ago

@pvretano Can you confirm that https://github.com/opengeospatial/ogcapi-records/issues/195#issuecomment-1346718568 makes sense to you, too? I'd like to release this behavior into STAC soon and it would be really great to have this aligned between Records and STAC!

Here's the corresponding STAC extension: https://github.com/stac-extensions/language#fields-for-catalogs-collections-and-item-properties

pvretano commented 1 year ago

@m-mohr looking at it today. Will update comment once I had reviewed.

m-mohr commented 1 year ago

Thanks @pvretano. While you are at it, do you think it makes sense to allow more than just the language codes in languages?

So for example instead of just "languages": ["de", "en-US", "gr"] we could also think about a bit more, which could be helfpul for clients. For example:

"languages": [
  { "code": "de", "name": "German", "native": "Deutsch", "dir": "ltr" },
  { "code": "en-US", "name": "English (US)", "native": "English (US)", "dir": "ltr" },
  { "code": "gr", "name": "Greek", "native": "Ελληνικά", "dir": "ltr" }
]

Only code would be required.

pvretano commented 1 year ago

@m-mohr my original comment was perhaps not as clear as it should have been because it did not distinguish clearly the language of the resource versus the language of the record.

The previous "language" tag was meant to encode the language of the resource that the record describes (if there was an associated language). So, changing it to an array allows a set of languages to be associated with the resource (e.g. the resource described by the record is available is English, German, Greek, etc.).

The language of the record itself (i.e. the language in which the record is presented to the client) is requested using the "Accept-Language" header when the record is retrieved. That language, however, is currently not explicitly encoded in the record with a specific tag. Rather a "rel=self" link can be included that includes an "hreflang" attribute to indicate the language of the retrieved record. Additional links with "rel=alternate" and "hreflang" attributes can point to additional language representations of the record.

Does this all make sense?

I am mocking up an example record with language information which I will add to the issue later today.
I like the encoding of "languages" that that you present above so I will use that.

If you think there would be value in explicitly encoding the language of the record in the record itself then I would not be opposed to reintroduing the "language" tag for that purpose ...

m-mohr commented 1 year ago

Thank you, @pvretano. This clarifies what the difference between STAC and Records is currently.

First and foremost, it is 100% clear and aligned between STAC and Records that in an API context content negotiation is used to request specific languages and report the language of a response. We are also aligned with regards to the hreflang property. Unfortunatly, there are also static catalogs - both in STAC and Records. Here content negotiation is often not available as such we need an alternative. Also, it is often useful to replicate imporant headers (e.g. the content language) in the body because if you store a response to a (local) file, you loose the (language) headers, but it could still be useful to have these information. Thus, my aim was to find a solution that works without headers for static catalogs and can also be useful in the context of APIs, I think.

For the language you may want to encode multiple things:

To encode the language of a resource we use the hreflang property in links and assets. Now the difference comes up:

In theory, you are right, we don't need these properties at all because it could all be handled through hreflang in links. self link + hreflang could describe the language of the metadata, alternate links + hreflang could describe other available languages, link to data file (resource) + hreflang could describe the language(s) of the resources.

This is pretty cumbersome though as you'd need to wade through links to figure this out. Also, in STAC self links are not required as catalogs can be portable and the location may not be known upfront. Also, I'm not overly happy with overloading "alternate" for alternative languages, alternative media types, alternative ... (but that's a different discussion). In the end, the language and languages properties are often just a "summary" and for convenience.

Still, I think it would be good to declare this directly without having to look through links with hreflangs.

Ultimately, we could also allow for a very verbose solution:

While "language" and "languages" could be aligned between Records and STAC, I'm not so sure about the "resourceLanguages". STAC doesn't need that in many cases and I wasn't able to come up with a good name that describes both cases (assetLanguages vs. resourceLanguages), so we may just have different properties here that don't conflict but share the same structure (as described above). An alternative could be redordLanguage, recordLanguages and languages, but then we'd be less aligned between STAC and Records because record doesn't fit into the STAC terminology. So I'd prefer the first variant, but happy to discuss other ideas and alternatives.

What do you think? Would you be open to that?

pvretano commented 1 year ago

@m-mohr just to make sure I understand ...

Is this correct? If yes, that I think I am OK with that. If you verify that that my understanding is correct then I will present to the SWG and report back in this issue. (NOTE: next SWG meeting is on the 23-JAN-2023 ... I hope that is not too late for you).

m-mohr commented 1 year ago

Thank you for taking the time, @pvretano. Yes, this is generally correct.

I have once concern though about the requirement in the second bullet. You are saying:

if there are alternate links in the record with hreflang attributes, the hreflang values must exists in this languages list

I see potential issues here which I mentioned above due to the overloading of the alternate relation type (alternate type vs. alternate language). Here's an example for some links that would not be unusual to see in STAC and I could imaging that it also occurs in Records (although I think you require the type, right?):

Let's say the links are in a metadata document in Greek (i.e. contains "language": "gr")

{
  "href": "../de/item.json",
  "rel": "alternate",
  "hreflang": "de"
},
{
  "href": "../item.json",
  "rel": "alternate",
  "hreflang": "en"
},
{
  "href": "https://stacindex.org/browser/example/de/item.json?uiLanguage=de",
  "rel": "alternate",
  "type": "text/html",
  "hreflang": "de"
},
{
  "href": "https://stacindex.org/browser/example/item.json?uiLanguage=en",
  "rel": "alternate",
  "type": "text/html",
  "hreflang": "en"
},
{
  "href": "https://stacindex.org/browser/example/item.json?uiLanguage=fr",
  "rel": "alternate",
  "type": "text/html",
  "hreflang": "fr"
},
{
  "href": "https://stacindex.org/browser/example/gr/item.json?uiLanguage=gr",
  "rel": "alternate",
  "type": "text/html",
  "hreflang": "gr"
}

You see that there are more languages available in the UI than for the metadata. I'd expect that languages would be something like the following (i.e. not include French):

"languages": [
  { "code": "de", "name": "German", "native": "Deutsch" },
  { "code": "en", "name": "English", "native": "English" },
  { "code": "gr", "name": "Greek", "native": "Ελληνικά" }
]

So either we make the relationship between languages and the alternate type less demanding or we have to clearly specify the corresponding media types, but that would (at least in STAC) be JSON + GeoJSON (+ missing type as type is not required in STAC yet).

Thank you for bringing it to the SWG. Jan 23 is fine for me. If it helps I could also join the meeting. I'll also prepare an update for the STAC extension that follows this proposal.

m-mohr commented 1 year ago

I just had another idea to "merge" resourceLanguages and languages into languages and just add boolean properties as follows:

"languages": [
  { "code": "de", "name": "German", "native": "Deutsch", "record": true, "resource": true },
  { "code": "en", "name": "English", "native": "English", "record": true, "resource": true },
  { "code": "gr", "name": "Greek", "native": "Ελληνικά", "record": true, "resource": false },
  { "code": "fr", "name": "French", "native": "Française", "record": false, "resource": true }
]

I'm not sure whether this is a good idea and whether this mixes separate concerns too much so looking for thoughts of others.

pvretano commented 1 year ago

@m-mohr my feeling is that it mixes separate concerns too much but lets give others a chance to chime in with their thoughts ...

m-mohr commented 1 year ago

Yeah, happy with that, too.

An addition to https://github.com/opengeospatial/ogcapi-records/issues/195#issuecomment-1380306075: Should the languages list contain the current language itself? I'd say for clients it would be good so it would just not be alternate, but alternate + self.

pvretano commented 1 year ago

@m-mohr yes I suppose the languages list should contain the current language as well although that is slightly redundent. Perhaps we can get rid of language tag and simple say the first item in the languages list is the language of the record in hand.

About this comment ... I hadn't considered that but I would say that the list of lanagues should include all the avilable languages independent of their media type representation. If there is a type dependency, that can be represented in the alternate links via the type attribute and/or negotiated between the client and server using the normal HTTP contant type and language negotiation handshake. Your thoughts?

m-mohr commented 1 year ago

@pvretano Interesting idea about putting the current language first. While I like having all in one place I don't like that it is not very explicit and "the average user" may get confused what the actual language is. It just needs good knowledge of the spec. Alternatively, we could also remove the current language from languages and instead of just proving a code for language use the "language object" from above als there. Phew... no strong preference right now.

Example:

"language": { "code": "gr", "name": "Greek", "native": "Ελληνικά" },
"languages": [
  { "code": "de", "name": "German", "native": "Deutsch" },
  { "code": "en", "name": "English", "native": "English" }
]

I'm not sure about adding adding e.g. the "UI languages" to the languages list. It feels a bit weird to me as it mixes separate concerns. For example, I'm currently making STAC Browser mutli-lingual with right now 6+ planned languages and the metadata only has 2 metadata languages. So the languages list would have 6 entries and that seems a bit excessive to have in the languages list...

(but of course I'm relatively biased right now towards the usecase I'm working on)

m-mohr commented 1 year ago

I updated the STAC extension to reflect what you proposed here: https://github.com/stac-extensions/language

pvretano commented 1 year ago

@m-mohr I have no strong perference. However if I had to pick I would say ... language for the current languages. languages for the list of other available languages. So, the current language is NOT in the list of other languages.
Still think that the list of other languages should contain all the available other languages regardless of the representation. The HTML representation is as valid as any other and likely one of the more common represenations ... no? I'll review the STAC extension write up later today ...

m-mohr commented 1 year ago

The HTML representation is as valid as any other and likely one of the more common represenations ... no?

No, not in my eyes. For me languages is the list in which the source metadata files are available. The STAC clients usually only work with the source metadata (JSON) variants and all other are just spit out or ignored. But I guess I could filter the languages somehow...

pvretano commented 1 year ago

@m-mohr I could be wrong about the HTML representation ... I'll present to the SWG and see what the others think.

pvretano commented 1 year ago

23-JAN-2023: Is STAC asset language is represented using hreflang in the asset section and there is a rule that basically says that if a STAC record is requested in a specific language AND the asset has associated languages, only the request language is represented in the asset section. So, if the STAC item is requested in Greek and there is a "Greek" asset, only that link will be listed in the asset section. Of course, all this only applies to the API; static records would probably include the links to all the available languages.

pvretano commented 1 year ago

@m-mohr with regard to the language parameter in the STAC API language proposal, why is it only a single language? Can't its value be the same string as that used for the Accept-Langauge header with the same semantics (e.g. `langauge=de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,fr;q=0.3``)?

m-mohr commented 1 year ago

@pvretano This was just meant as a very simple alternative for "tinkering" in "simpler" environments, e.g. in the Browser where it's not easily possible to send HTTP headers. So I kept it simple. Recently, I've actually thought about removing the parameter altogether and just relying on header. What do you think? What's the general direction OGC APIs go for? I've often seen e.g. ?f=json in OGC API implementations as an alternative to Accept headers, which would somewhat align with the current specification of ?language=de, it seems.

pvretano commented 1 year ago

@m-mohr the usual thinking at OGC is to "recommend" that implementations have a mechanism to mint URLs that need to be embded or for situations where the client does not have easy access to the use of HTTP headers. So, take f for example. That is not part of the specification per se. It is just an example for creating URLs where the output format can be specified. I guess it would be the same thing with a language parameter. It would not be "standard" but only a suggestion that implementations create a mechanism for requesting records in a specific language when access to the HTTP headers is not feasible. In all cases the HTTP way is the normative way.

m-mohr commented 1 year ago

@pvretano Then I'd suggest following the same pattern. As I can't find anything about f in the specs (features, records), I'd also remove it from the STAC API - Languages extension.

pvretano commented 1 year ago

@m-mohr here is the reference to f in Features ... https://docs.opengeospatial.org/is/17-069r4/17-069r4.html#encodings It's in the NOTE in that section ...

m-mohr commented 1 year ago

@pvretano Thanks, I did not find that (but "f" is also not an ideal search term ;-) ). So you'd add a similar wording for language or accept-language into Records? Then I'd just refer back to that in the STAC API extension.

pvretano commented 1 year ago

@m-mohr yes ... that is my plan.

pvretano commented 1 year ago

PR #211 created to align language handling as per this discussion in this issue.

m-mohr commented 1 year ago

@pvretano Added a comment in the PR, thanks.

pvretano commented 1 year ago

01-MAY-2023: Resolved by #211. Closing.