Support providing text field data in multiple languages

cjerdonek commented 9 years ago

For all text fields, the spec should have a way to allow providing that information in multiple languages. In San Francisco, for example, it is a requirement that all election information be provided in English, Spanish, and Chinese (as well as in Filipino starting January 1, 2016).

Currently, it seems like VIP consumers wouldn't be able to meet the same language requirements that jurisdictions may have (unless there is a way of providing additional languages that isn't documented in the web site documentation).

jungshadow commented 9 years ago

I'm adding this to the "Version 5.0" bucket, but there may be some push back. Personally, I want this to happen and soon. It will ultimately come down to how easy it is to incorporate translations in every VR/EMS system (assuming the translation data is even held in either).

I imagine that @pstenbjorn might have some insight on this.

cjerdonek commented 9 years ago

One of my action items from the meeting was to propose something for this issue.

Before coding it, I wanted to describe what I was going for. My goal was to make adding multi-language support to an element in the schema as simple and DRY as possible, for example by changing this--

<xs:element name="greeting" type="xs:string"/>

to this--

<xs:element name="greeting" type="multiLangText"/>

A concrete example would look like this--

<greeting>
    <text lang="en">Hello</text>
    <text lang="es">Hola</text>
    <text lang="fr">Bonjour</text>
</greeting>

Ideally, the following would also be acceptable (if only English were available and for backwards compatibility, etc)--

<greeting>Hello</greeting>

(It looks like <xs:union> would allow a type to be defined that is either an xs:string or a complex type.)

Does this seem good to people?

cjerdonek commented 9 years ago

And here is a rough stab at a schema definition for the idea described in the previous comment (I make no claims to have expertise in XML):

<!-- A string with a required language specifier. -->
<xs:complexType name="langString">
    <xs:simpleContent>
        <xs:extension base="xs:string">
            <!-- TODO: should the language values be restricted to certain values? -->
            <xs:attribute name="lang" type="xs:string" use="required"/>
        </xs:extension>
    </xs:simpleContent>
</xs:complexType>
<!-- A text value in one or more languages. -->
<xs:complexType name="multiLangText">
    <xs:union>
        <xs:complexType>
            <xs:sequence>
                <xs:element name="text" type="langString" minOccurs="1" maxOccurs="unbounded"/>
            </xs:sequence>
        </xs:complexType>
        <!-- A simple string can be provided instead if only English is available. -->
        <xs:string/>
    </xs:union>
</xs:complexType>

pstenbjorn commented 9 years ago

@cjerdonek this is a good start. The W3C has defined an xsd type of xs:language - see documentation here.

Below is an example based on our conversation using the existing VIP schema. The xs:language element expects valid RFC 3066 language definition - en-US.

<xs:element name="referendum">
  <xs:complexType>
    <xs:choice maxOccurs="unbounded">
      <xs:element name="title" type="xs:string" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="subtitle" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="brief" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="text" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="pro_statement" type="ballotLanguage" />
      <xs:element minOccurs="0" maxOccurs="unbounded" name="con_statement" type="ballotLanguage" />
      <xs:element minOccurs="0" name="passage_threshold" type="xs:string" />
      <xs:element minOccurs="0" name="effect_of_abstain" type="xs:string" />
      <xs:element name="ballot_response_id">
        <xs:complexType>
          <xs:simpleContent>
            <xs:extension base="xs:string">
              <xs:attribute name="sort_order" type="xs:integer" />
            </xs:extension>
          </xs:simpleContent>
        </xs:complexType>
      </xs:element>
    </xs:choice>
    <xs:attribute name="id" type="xs:string" use="required" />
  </xs:complexType>
</xs:element>

<xs:complexType name="ballotLanguage">
  <xs:all>
    <xs:element name="text" type="xs:string" />
    <xs:element name="lang" type="xs:language" />
  </xa:all>
</xs:complexType>

cjerdonek commented 9 years ago

@pstenbjorn Thanks. I have two main comments on your proposal.

First, my preference is that the type definition itself include the sequence aspect. This makes the schema simpler and more DRY (i.e. by not having to include the maxOccurs="unbounded" in every usage of the type, but rather just once in the type definition).

Second, I also think it's important that the translations of a particular element be semantically grouped to reflect the structure, as opposed to having everything flattened.

So this--

<subtitle type="multiLang">
    # Translations
</subtitle>
<brief type="multiLang">
    # Translations
</brief>
<text type="multiLang">
    # Translations
</text>

as opposed to this--

<subtitle type="ballotLanguage"></subtitle>
<subtitle type="ballotLanguage"></subtitle>
<subtitle type="ballotLanguage"></subtitle>
<brief type="ballotLanguage"></brief>
<brief type="ballotLanguage"></brief>
<brief type="ballotLanguage"></brief>
<text type="ballotLanguage"></text>
<text type="ballotLanguage"></text>
<text type="ballotLanguage"></text>

I think this is conceptually clearer. The grouped approach also has the advantage that if you wanted to allow multiple text elements with the same tag, then you could still do this. For example, for--

<xs:element maxOccurs="unbounded" name="alias" type="multiLang" />

you could do--

# The first alias has an English form and translations.
<alias type="multiLang"></alias>
# The second alias has an English form and translations.
<alias type="multiLang"></alias>

With the "flattened" approach, the maxOccurs="unbounded" has already been "used up" for the translations, so you wouldn't be able to simply add the maxOccurs attribute as you normally would to allow more elements.

Both of these issues I addressed in my proposal. Otherwise, I'm okay with the suggestion to use xs:language.

cjerdonek commented 9 years ago

Rather than updating my suggestion in the discussion thread, I created a pull request #49 (which also incorporates @pstenbjorn's suggestion to use xs:language).

pkoms commented 9 years ago

In case it's dropped out of the project consciousness, I'd like to add a reminder here that we should hold on merging in any XML changes (e.g., #52) until we've considered the implications for CSV design and CSV -> XML transformation. @nomadaisy currently has point on that.

jungshadow commented 9 years ago

@pkoms Hmm...I didn't think we planned to hold on all changes. If you think there may be issues with certain changes, you should definitely flag those (NB: this is probably one), but the XML dev should constantly be moving forward. If you are going to request a block, please give a general idea of how long you'll need for an assessment.

pkoms commented 9 years ago

Sorry, poor word choice. I meant to say "merging in any XML changes [on multi-language support.]" Completely agreed that the XML should be constantly moving forward.

We should be considering our CSV options this week.

jungshadow commented 9 years ago

@pkoms Ah...lost in translation :) Great! Thanks!

cjerdonek commented 9 years ago

@pkoms @nomadaisy Could one of you describe what's needed on the CSV side for this, as well as any requirements that have to met? I may be able to propose something or at least offer some suggestions.

pkoms commented 9 years ago

@cjerdonek before I go into any depth, can you let me know how familiar you are with the VIP CSV specification? Knowing that will make it easier to pitch my explanation appropriately.

cjerdonek commented 9 years ago

Thanks, @pkoms.

can you let me know how familiar you are with the VIP CSV specification?

Not at all, honestly. But if you point me to documentation and/or a good sample file or two (if those already exist somewhere), I can bring myself up to speed on those aspects. That way you don't need to explain everything yourself from the beginning. If the main VIP docs have all I need to know, I can just read that.

cjerdonek commented 9 years ago

Okay, it looks like the VIP docs have a quite a bit on this. A couple CSV approaches occur to me. I can describe them briefly and see what you think.

cjerdonek commented 9 years ago

Here are a couple CSV approaches.

Both approaches use "resource files," which in the context of internationalization means that the translations are provided in one or more separate files (CSV files in this case).

In both approaches, for item types that contain internationalized strings, the comma-delimited flat file for an item would contain the internationalized_text_id as the value for the item, and not the actual text. For example, for the following--

<office id=1>
    <name internationalized_text_id="office_mayor">
        <text lang="en">Mayor</text>
        <text lang="es">Alcalde</text>
        <text lang="zh">市長</text>
    </name>
</office>

the CSV would look like--

name,id
office_mayor,1

The text translations would then be in separate resource files (which would be global for the entire XML feed). Two possible approaches for this are as follows.

Approach 1: Multiple resource files.

In this approach, there would be a separate file for each language (with the file suffix indicating the language). For example--

File language_en.csv:

internationalized_text_id,text
office_mayor,Mayor
...

File language_es.csv:

internationalized_text_id,text
office_mayor,Alcalde
...

Approach 2: Single resource file.

In this approach, a single file contains all the languages, with the column headers indicating the language for that column. For example--

File: translations.csv

internationalized_text_id,en,es,zh
office_mayor,Mayor,Alcalde,市長

Do either of those approaches sound okay?

cjerdonek commented 9 years ago

One advantage of both approaches above is that they are DRY (and so less verbose). The translations of a given string of text occur only once in the entire feed (i.e. in the resource file), as opposed to in every occurrence in the feed.

Also, one advantage to Approach 1 above is that support for additional languages can be provided simply by adding a new CSV file for that language, without having to touch any other part of the feed. Similarly, jurisdictions can add support for a language simply by sending their "English CSV" off to a translation service, and getting back another CSV for the new language.

pkoms commented 9 years ago

Thanks! We had something close to 2 on the table, but 1 is interesting as well. Tagging @nomadaisy just to make sure these get on her radar.

nomadaisy commented 9 years ago

Hi all, I've asked our Dev team, and the easiest thing for them to incorporate would be the single-resource file, like Chris's example 2, with additional columns for the additional languages for that field. I've identified the following fields as display text that should be translated:

ballot_response.txt: text candidate.txt: party, biography contest.txt: primary party, electorate_specifications, office custom_ballot.txt: heading early_vote_site.txt: directions, voter_services, days_times_open election.txt: registration_info, absentee_ballot_info election_official.txt: title polling_location.txt: directions referendum.txt: title, subtitle, brief, text, pro_statement, con_statement, passage_threshold, effect_of_abstain source.txt: description

I'm interested in @kennethmbennett's take on how many fields are stored in the database in other languages. Should we enable translation for display text fields only or include all text fields like precinct.txt: name? Does your system store non-ballot, non-display text fields in other languages?

I also did not include location names, such as polling_location.txt: name and early_vote_site.txt: name. Is it helpful to translate those, or should we leave them as they appear, since they refer to a proper name that might not need to be translated?

cjerdonek commented 9 years ago

@nomadaisy One brief comment re: proper names. Candidate names are one example of a type of proper name provided in other languages, at least in San Francisco. San Francisco provides them in Chinese. I imagine that in cases where the language uses different characters, a translation of a proper name would be possible (though I don't know these languages firsthand).

Second, on a slightly different topic, it might be worth talking about how to choose or generate the internationalized_text_id for a string.

A convention of something like object_type__field_name__id might be a good starting candidate. So for the following, we would have something like--

party.name: party_name_democrat, party_name_republican, etc.
party.description: party_description_democrat, party_description_republican, etc.

If the ID portion could be generated programmatically, that would be even better (e.g. the ID of the parent object).

jungshadow commented 9 years ago

@nomadaisy Agree that a single resource file would be easier for the devs on our side, but I'm wondering how difficult it would be for the states to structure the information in that way (i.e. I have a feeling the data isn't linear or in the same system). Adding @Josh-LACRRCC to this conversation to assess since he's the database/ballot/XML expert.

Josh-LACRRCC commented 9 years ago

Here in LA County, we use a contracted vendor for all our translated election materials. We would like to bring the support of various languages into our EMS, but we have not yet begun building specifications for our requirements. We recently produced a bilingual text ballot as a manual process.

A single resource file does cut down on clutter, but operationally; language specific resource files provide a smoother workflow. Also, as @cjerdonek points out, the separate resource files have a better scalability. LA County currently supports a total of eleven languages with the possibility of adding another three.

Under our current system, we do not translate proper names. Candidate names, and location name (both in proper elements like polling_location.txt: name or a city name within referendum.txt: text) are left in English. If an element is missing / optional in the resource file and a system knows to "default" to English, then both LA County and San Francisco's policy about candidate names would be covered.

There are a couple items that I would add to @nomadaisy list of translation elements election.txt: name, election_type party.txt: name Some of the enumerations would also be relevant.

cjerdonek commented 9 years ago

FWIW, I learned a little about SF's process this week. SF also uses a translation vendor. They have some Excel spreadsheets that are roughly of the form: one word or phrase per row, with different languages in each column. Those spreadsheets currently have three and a half languages (English, Chinese, and Spanish, and Filipino is still being worked on). You can see one of the spreadsheets exported to CSV form here. In that same repo, I'm going to be playing around with these files to experiment with different formats (e.g. generating YAML, which seems to be more suitable for editing multi-line strings by hand; JSON; HTML for display, etc).

Also FWIW, on the county side, a script to convert a single-file resource file to multiple files (i.e. one per language) or vice versa should be pretty simple as conversions go. If the files are structured to begin with, I would guess thirty lines of code or so.

cjerdonek commented 9 years ago

If you're looking for data to play around with, here are some of the translations I mentioned in the previous comment cleaned up and converted to YAML files, one per language: https://github.com/cjerdonek/sf-base-election-data/tree/master/pre_data/i18n/auto

jungshadow commented 9 years ago

@cjerdonek Safe to close this issue for now? When we get into the implementation, we can reopen or create a new issue for any CSV conflicts.

cjerdonek commented 9 years ago

@cjerdonek Safe to close this issue for now?

Yes, thank you, @jungshadow!

votinginfoproject / vip-specification

Support providing text field data in multiple languages #39