relaton / relaton-models

Bibliographic models
4 stars 2 forks source link

Make basicdoc the default text model for textual content in Relaton #52

Open opoudjis opened 6 months ago

opoudjis commented 6 months ago

From https://github.com/metanorma/bipm-si-brochure/issues/224

So we are actually stepping into unchartered (undefined) territory, because the "title" element's content model is currently undefined for rich text.

Are we doing to do that now in the Relaton data model, to define the text model for textual content?

Officially, we're agnostic, and allow text models to be made explicit in places like titles, which allow xs:any

FormattedString =
  # attribute format { ( "plain" | "html" | "docbook" | "tei" | "asciidoc" | "markdown" ) }?,
  attribute format { ( "text/plain" | "text/html" | "application/docbook+xml" |
    "application/tei+xml" | "text/x-asciidoc" | "text/markdown" | "application/x-metanorma+xml" | text ) }?,
  LocalizedStringOrXsAny

LocalizedStringOrXsAny1 =
  # multiple languages and scripts possible: comma delimit them if so
  attribute language { text }?,
  attribute locale { text }?,
  attribute script { text }?,
  ( text | AnyElement )+

LocalizedStringOrXsAny =
  LocalizedStringOrXsAny1 |
  element variant { LocalizedStringOrXsAny1 }+

AnyElement = element * { ( text | AnyElement)+ }

That's what's in the grammar, and what we were thinking 5 years ago.

De facto, we do have a text model for textual content already, and we've been using it. Unsurprisingly, it's Metanorma itself, or rather, the core of it in Basicdoc. So with IETF abstracts, we use <p> not IETF's native <t>; we replace Latex formatting in Bibtex-derived titles with Basicdoc. Basicdoc of course is pretty much HTML at the inline markup level, so it's a safe default.

The Relaton grammar is already using Basicdoc: it's why it recognises <image> for logos.

Suggest we make this official, and make Relaton text like titles be either text, or Basicdoc XML.

opoudjis commented 6 months ago

@andrew2net No real consequence for you, just making something we are already doing be acknowledged in the grammar. The grammar for titles will be the grammar of a Metanorma paragraph's contents, string with familiar optional tags like <em> and <sup>. We are ruling out absurd things like Latex formatted titles.

andrew2net commented 6 months ago

@opoudjis do we need any update in the model to be able using <sup> in the document IDs?

opoudjis commented 6 months ago

Yes, but this is about titles, not identifiers. We don't allow markup in identifiers yet...

opoudjis commented 6 months ago

Elsewhere, I see that identifiers markup has been requested: https://github.com/metanorma/bipm-si-brochure/issues/224 .

opoudjis commented 6 months ago

@andrew2net Grammar updated.

andrew2net commented 5 months ago

@opoudjis title's content was FormattetString which inherits from LocalizedString which has a language attribute. So we don't have language-specific titles anymore. Is this correct?

opoudjis commented 5 months ago

It isn't, we do continue to have language specific titles, but do LocalizedString and FormattedString represent language choice differently? If so, I'll put an explicit language attribute back in, I don't want to break dependent code with this change.

andrew2net commented 5 months ago

FormattedString uses LocalizedString as a content. So language representation is defined in LocalizedString. But the title doesn't use FormattedString anymore, therefore it doesn't have language representation now.

andrew2net commented 5 months ago

@opoudjis FYI subdivision also doesn't have laguage specification anymore. As well as affiliation/description.

andrew2net commented 5 months ago

@opoudjis we have <p> elements in titles and abstracts. We transform them into <t> for NIST's BibXML format. With this update <p> is not allowed anymore. Shouldn't we add it to our grammar?

opoudjis commented 5 months ago

Crap, I was hoping that wouldn't turn up.

We need to for abstracts (not titles), yes. I'll update the grammar. There are several omissions here that you've pointed out...

opoudjis commented 5 months ago

I've put the language attributes back in, and I've made abstracts be either multiple text elements or multiple blocks (including paragraphs).

The one thing change that is not applicable is that subdivision has been refactored: it is no longer the name of a subdivision of an organisation, but an entire organisation itself, potentially with multiple levels of recursion:

organization = element organization { OrganizationType}
OrganizationType =
    orgname+, subdivision*, abbreviation?, uri*, org-identifier*, contact*, logo?

So there is no prospect of there being a language attribute on subdivision, though there is of course a language attribute on organization/subdivision/name.

Please confirm this is now working correctly.

opoudjis commented 1 week ago

The flattening of definitions in RNC is running counter to my need to reuse attribute groups in downstream grammars, and I am unwilling to compromise the documentation of my grammars for the incompetence of the implementers of Shale.

As part of reconciling the updates here with the updates because of https://github.com/metanorma/metanorma-model-iso/issues/80, I am reinstating the modularised attribute groups, such as LocalizedStringAttrs. @andrew2net You will need to tell me if lutaml-model still fails to deal with them, but frankly, if we went to all this rigmarole of reimplementing Shale from scratch, we need to make it deal with quite routine grammar modularity. I should not have to lobotomise my grammars because of poor implementation.

opoudjis commented 1 week ago

@andrew2net Updates I've made during refactoring of biblio:

  1. URIs can be specific to a language; in fact, this is being used already with TypedUri e.g. in BIPM, to differentiate French-language and English-language links. I'm extending that to untyped URI:
uri =
  element uri {
    ## The types of URI are open-ended, but include the IANA link relations specified in RFC 8288
    attribute type { text }?,
    LocalizedStringAttributes,
    ## URI content
    xsd:anyURI
  }
  1. Formatted address can contain <br/>;
formattedAddress = element formattedAddress { (text | br)+ }
  1. personidentifier/@type is no longer constrained to an enum:
## An identifier of a person according to an international identifier scheme
person-identifier =
  element identifier {
    ## The international identifier scheme for the identifier of the person.
    ## Examples include "isni", "orcid", "uri"
    attribute type { text },
    ## The identifier value
    text
}
  1. Redefined classification to use the same type as docidentifier: it already substantially overlapped:
bclassification = element classification { DocIdentifierType }
  1. place optionally has a URI disambiguating it: this was included in the ISO 690 model, but never added to Relaton:
## Place associated with the production of a bibliographic item
bplace = element place {
    ## City
    bibliocity?,
    ## Region that city is located in, given for disambiguation purposes.
    biblioregion*,
    ## Country that city is located in, given for disambiguation purposes.
    bibliocountry*,
    ## Name of the place, not broken down semantically
    formattedPlace?,
    ## URI in a geographical registry identifying the place
    uri?
}
  1. size/value might be empty as a tag: the attribute may provide enough information (e.g. "whole" gets used in extent to denote the entire document)
sizevalue = element value {
  ## The type of size. Recommended values: page, volume, time (in ISO 8601 duration values), 
  ## data (including unit), value (free-form string)
  attribute type { text },
  ## The quantity of the size
  text?
}