google / transit

https://gtfs.org/
Apache License 2.0
605 stars 182 forks source link

Add new optional field for Text-to-Speech functionality in Service Alerts feed: tts_ivr #132

Closed ibi-group-team closed 5 years ago

ibi-group-team commented 5 years ago

Many feed producers including transit agencies deploy Interactive Voice Response (IVR) systems with Text-to-Speech (TTS) technology that allows their customers to dial a number and receive information including service alerts in audio format. This is particularly useful for transit customers who do not have access to smartphones or computers, or who do not have the ability to read text (due to visual impairments or illiteracy).

The current specification for Service Alerts provides useful options for disseminating and summarizing alert information in text format via the header_text and description_text fields, and allows producers and consumers of the feed to disseminate this information in specific languages by incorporating the TranslatedString data type. However, this is not compatible with IVR/TTS systems because there is no field to disseminate text that is properly formatted for input into a TTS generator. For example, a feed producer may want to publish text with phonetic spelling if the TTS generator would otherwise mispronounce the word, or a feed producer may want to add spaces between numbers if the intent is for the TTS generator to pronounce each number (for example, “one two three” instead of “one hundred twenty three”).

To solve this issue we propose that a field is added called “tts_ivr” that contains a text description of the alert formatted for input into a TTS generator. The tts_ivr field would be optional and comprised of TranslatedString data type.

Incorporating the tts_ivr field in the specification will enable producers and consumers to better support IVR/TTS systems.

This concept can be extended beyond IVR, and can be useful for accessibility in public address systems, virtual assistants and other applications. For example, a parallel field providing TTS text could be added so that consumers can disseminate alert information via public address systems located at transit stops and stations. Such a field could be titled “tts_pa.” Other examples are TTS-compatible fields for the header_text and description_text fields (i.e. tts_header_text, tts_description_text).

skinkie commented 5 years ago

In GTFS-RT we are exchanging content. Could you elaborate why TTS/IVR (a specific implementation) would justify to exchange style?

ibi-group-team commented 5 years ago

@skinkie in this case, it can be argued that we are proposing to provide content (not style) for text-to-speech tools. The format provided in the proposed tts_ivr field is required so that IVR systems can ingest it and provide alerts in audio format.

skinkie commented 5 years ago

The rendering of text to phonemes is something that is similar to a prediction. A consumer can differentiate between the quality based on tooling. I am worried that this does not work for all use cases in GTFS(-RT) specifically with translations.

abyrd commented 5 years ago

Two comments: First, I don't see a clear-cut delineation between content and style here. Depending on the consuming system, the rendering of the information may need to be visual or auditory. There are valid cases where the information would need to be encoded as either graphemes or phonemes for accurate rendering. Second, I don't believe there is any requirement that GTFS exchange only content and not style or rendering information. It already contains fields specifying route colors for example, and no one is suggesting the use of external formats and software modules to handle route coloration.

About the original proposal: I don't understand why there would be a distinction between the text sent to an IVR and sent to a PA system (tts_ivr versus tts_pa). I would prefer to stick to the general approach of prefixing existing field names with tts_ to provide speakable equivalents that could be used in any context.

LeoFrachet commented 5 years ago

If I follow your thoughts Andrew (@abyrd), you offer to alter the proposal, by making just one text-to-speech field which would be tts_description_text, and which would be the text-to-speech version of description_text.

It would partially fit the need of the original proposal, allowing to write "123" as "1 2 3" if it should be read as "one two three" (or even directly "one two three")

@ibi-group-team: Would this fit your need? It is less implementation specific than IVR, but matches the current discussion (and conclusion) for tts_stop_name (see here).

ibi-group-team commented 5 years ago

Our proposal aims to address two issues:

  1. There are fields that need to be read out, in which the phonetic spelling needs to be provided. In this case, prefixing existing fields with tts_ (such as tts_description_text) would suffice.
  2. Different output channels have different constraints, in terms of both content and length, which need to be customizable by the consumer. In our experience, agencies require certain combinations of the data for different output channels. For example, a PA system may require only the header_text information (and depending on the PA system may require the producer to shorten the header_text field to meet a length constraint), while an IVR system may require both the header_text and descriptiontext information. In this case, prefixing existing fields with tts would not suffice, and new fields are needed.

Our proposal of tts_ivr would address both these issues. We understand that the tts_ivr field may not be useful or applicable to all consumers and producers (even when proposed as an optional field), and so we understand if this proposal is not approved. In that case, we are ok with requesting an extension for this field.

LeoFrachet commented 5 years ago

I understand the logic of requesting an extensions for tts_ivr, and I understand why IBI rather want a private tts_ivr which will fit 100% of your needs than standarized tts_header_text and tts_description_text that would fit 90% of your need and requires some code on both side.

But I muss confess that on an industry level (and IMHO on the long run even for IBI), building those kind of private extensions may not be the most cost effective. If tomorrow we add the tts_header_text & tts_description_text values to the official spec, will you provide both the official and the private fields? You likely will have to, for backward compability reason. And down the road I'm usure that making those private extensions will be really cost effective.

That being said, I'm obviously pushing for a standardized approached, and I fully understand that you have contraints that I'm not aware of, which likely tilt the balance in the tts_ivr direction. I just want to be sure you've seen what will likely happen in the next years.

(Side not, if the need is about the content, and what you need is a "short" description text, aka a description text with a maximal length, summarizing the alert, that could be easily read at loud or push as notification, I thing there might be an industry need for that. Maybe summary_text and tts_summary_text could be discussed).

barbeau commented 5 years ago

I agree with @LeoFrachet here - it would be great if we could agree upon a set of TTS fields for GTFS-rt alerts that would meet the majority of cases.

Different output channels have different constraints, in terms of both content and length, which need to be customizable by the consumer. In our experience, agencies require certain combinations of the data for different output channels. For example, a PA system may require only the header_text information (and depending on the PA system may require the producer to shorten the header_text field to meet a length constraint), while an IVR system may require both the header_text and descriptiontext information. In this case, prefixing existing fields with tts would not suffice, and new fields are needed.

I understand this requirement - but this same problem exists for printed text for agencies that want to disseminate alerts via email, web banners, text message, tweets, etc. Yet we only have a single set of fields in GTFS-rt alerts for text descriptions that seems to work reasonably well.

That being said, as Leo mentions maybe we could find a set of fields that would work that have a maximum length or duration associated with them? In other words, if the main difference between IVR and PA is duration, then having a fields scoped for duration/length that aren't specific to a technology should also work.

LeoFrachet commented 5 years ago

AFAIR, IBI already provides an extra field for MBTA alerts, which is "short content". Does the tts_ivr content match this already existing extra field?

I do think I already know consumers which would be interested to see this field standardized, for push notifications purpose.

ibi-group-team commented 5 years ago

@LeoFrachet you are referring to the fields short_header_text and service_effect_text which are shortened versions of the header and description text fields. Neither of these fields provide the content we are looking for in the tts_ivr field.

It seems that this field may only be applicable to IBI. We will open a pull request with a full proposal and call for a vote soon.

abyrd commented 5 years ago

I wouldn't say that the field is "only applicable to IBI", the specific proposal is just too tightly coupled to the needs of a particular consumer to be part of a general data format specification. The ideal would be to define fields with characteristics that fit a a broad range of use cases, such that IBI's use case falls within the broad definitions of those new fields.

I'm simply hoping we can achieve consistency across the GTFS family of formats. Allowing tts_ variants of existing text fields throughout the specs seems like a good way to do this.

For example, a PA system may require only the header_text information (and depending on the PA system may require the producer to shorten the header_text field to meet a length constraint), while an IVR system may require both the header_text and descriptiontext information. In this case, prefixing existing fields with tts would not suffice, and new fields are needed.

Can you explain why the existing fields would not suffice in this case and why new fields are needed? It seems clear that according to these rules, the PA system would read tts_header_text, and the IVR would read tts_header_text followed by tts_description_text, and the producer would design the content of those fields such that it would sound coherent when they were read in this way. If there is a hard requirement from the primary feed consumer that PA announcements be less than a certain length, then an additional local rule would be applied to keep tts_header_text under that length, even if it differs slightly from the main header_text. This just requires an agreement between primary producer and primary consumer, as part of the process for adapting the text field for a speech system. Local rules can restrict the content of fields and assure that they can be used in a predetermined way, while remaining within the broader definitions in the main spec.

abyrd commented 5 years ago

@barbeau said:

this same problem exists for printed text for agencies that want to disseminate alerts via email, web banners, text message, tweets, etc. Yet we only have a single set of fields in GTFS-rt alerts for text descriptions that seems to work reasonably well.

I completely agree: I don't see any reason why length restrictions are more relevant on spoken fields than on visually rendered text. The latter is used most frequently and needs to be wedged into all sorts of different shapes and sizes.

Variable text length seems like a general concern, not something specific to IVR. Just as we could prefix all text fields with tts_ we could also prefix all text fields with short_ and even combine the two into short_tts_. For this kind of blanket variations in text content we should consider introducing such generative rules for new optional field names, rather than adding them one at a time as special features.

ibi-group-team commented 5 years ago

We see the argument for adding existing text fields prefixed with tts or variations of that (short and shorttts). It’s true that agencies strive for consistency across output channels, but in some cases (such as this one), they have a need to edit text for each output channel (for example, programmatically shortening messages to meet a character count restriction may produce results that need to be edited manually, Twitter has a character count restriction and may include hashtags, website banner text may include all caps formatting, etc.). Using existing text fields with tts_ will not handle all the needs required by systems such as IVR. Each system requires input of different content, length, structure targeted towards different audiences, and producing this input programmatically is a complicated procedure that may not produce accurate results in all cases, and better handled using manual text input.

barbeau commented 5 years ago

@ibi-group-team does IVR have any restrictions on content other than length? For example, why wouldn't a tts_*_short field with a hard character limit work?

ibi-group-team commented 5 years ago

@barbeau yes IVR has restrictions on content for more human-friendly pronunciations. For example, stop numbers will be spelled out with spaces in between, i.e. stop 1 2 3 (one two three), NOT stop 123 (one hundred twenty three).

barbeau commented 5 years ago

yes IVR has restrictions on content for more human-friendly pronunciations. For example, stop numbers will be spelled out with spaces in between, i.e. stop 1 2 3 (one two three), NOT stop 123 (one hundred twenty three).

@ibi-group-team Sorry, maybe I wasn't clear. Generally, in any TTS field I'd expect stop numbers to be spelled out phonetically and represent how they are referred to in the vernacular for that agency when used in spoken communication. For stop 123 I'd think the typical TTS field would contain one twenty three (although if riders typically refer to it as one two three that's also certainly valid). 6123 would be sixty one twenty three.

So, assuming this phonetic content is the same for all TTS type fields, my question was "is there anything specific to the TTS IVR content other than length when compared to other TTS dissemination methods" (e.g., mobile app accessibility)?

If so, then having a more general tts_short field instead of the very specific tts_ivr field would work, as long as there is a length restriction placed on that field.

abyrd commented 5 years ago

@ibi-group-team, it does not seem to me that these restrictions on IVR text would require an additional field.

First, in line with @barbeau's comment, the purpose of the TTS fields is to write out words and numbers as they would be pronounced to assist a text to speech system. This is the same as one of the needs you have described for IVR.

Second, if for specific local uses, limitations exist on the length of text-to-speech strings, nothing would prevent you from further restricting the content of these TTS fields. Making the content shorter would still yield content in line with the definition of TTS fields. A simple agreement between primary producer and consumer would allow limiting this text length, while the GTFS standard would contain a somewhat less restrictive definition, which would still be met by the somewhat more restrictive local rules.

So in short: does the producer really want to have separate general purpose TTS content and single-purpose IVR content, of different lengths? If not, can't a general TTS field contain the IVR text, fitting the local use case but also yielding a general purpose field for use by all other RT producers? If so, can't a general tts_x and tts_short_x fields also provide these long and short variants?

ibi-group-team commented 5 years ago

@barbeau @abyrd we agree with your point that general tts_ and ttsshort fields instead of a specific ttsivr field would work, and are open to updating our proposal to reflect that. However, one thing to consider is that these tts/ttsshort fields may have different content than their corresponding text fields. tts_ fields will need to be editable to allow users to customize them for any text-to-speech implementation, thereby allowing users to change their core content. Do you see this as a problem?

barbeau commented 5 years ago

@ibi-group-team As long as the normal text and tts fields reflect the same overall information this is fine - I wouldn't expect the text and tts fields to have the exact same content, as the tts fields need to be customized so the TTS engine can pronounce the word correctly. Does this answer your question? If not, could you give examples of how the normal text and tts fields would differ?

abyrd commented 5 years ago

@ibi-group-team can you clarify what you mean by "fields will need to be editable"? The term "editable" seems most relevant in an interactive application for creating alerts (rather than a backend data pipeline component), and in such an application I would expect every field to be editable. Any fields with exactly the same content would be redundant, so all fields should be independently changeable.

When you say "users" can change their core content, again I will assume you mean users of an alert creation system, who are publishing the feed. In that case yes, these fields can definitely contain different text as long as it's intended to convey the same meaning. But does changing core content mean altering the meaning significantly? I don't think any field within a single alert should carry information about a fundamentally different event or disruption than the other fields.

But if by "change their core content" you just mean "express the same information differently so it will be read correctly by the text-to-speech system used by the most common consumers of the feed" then yes, absolutely. That's would be the whole purpose of these fields.

ibi-group-team commented 5 years ago

@abyrd yes we are referring to users of an alert creation/management system and to fields that are populated and edited using that system.

Users who enter alerts will have the ability to edit the content of the TTS fields, which means a user could theoretically edit the TTS message in a way that the content/message differs from the main text field in ways other than just pronunciation changes. We do not expect that users will change the meaning of the TTS fields to be significantly different than their corresponding text fields, but it should still be considered a possibility. For example, in a case where header_text = "Route 123 is delayed due to construction.", tts_header_text could be = "Route 1 2 3 is delayed due to construction. Please use Route 4 5 6 instead." We would expect users to include the phrase 'Please use Route 456' in the header_text, but since all fields are editable it may not always happen.

We will be updating our proposal to replace the tts_ivr field with two optional fields, tts_header_text and tts_description_text.

LeoFrachet commented 5 years ago

Sounds good!

@ibi-group-team The difference of content you describe between header_text and tts_header_text is consistent with what I've seen between stop_name and tts_stop_name: the agency change the name of the stop when it's read out loud in the bus, for many different reasons. I think it can make sense, as long as (as Andrew @abyrd said) it doesn't change their core content.

Since we seem to agree on the solution, I assume the conversation will continue on the proposal page: https://github.com/google/transit/pull/135

@barbeau @abyrd & @ibi-group-team: may I close this issue?

barbeau commented 5 years ago

@LeoFrachet I'd leave this open until #135 is closed/merged. Ideally the proposal commit title will reference this issue and then merging (hopefully) the commit will automatically close this issue.

LeoFrachet commented 5 years ago

Proposal as been adopted. So I'm closing the issue.