google / transit

https://gtfs.org/
Apache License 2.0
584 stars 179 forks source link

Clarification on language code data standards used in translations.txt #435

Open julieteldred opened 7 months ago

julieteldred commented 7 months ago

Introduce yourself

Juliet Eldred, Project Manager, Trillium/Optibus

Ask a question

Trillium has been producing translations.txt files for a client's project, and while the translations data did not seem to have any issues being ingested into the Transit App, when my colleague tried to submit the feed with a translations.txt file for Google Transit, it returned a "Value of the language column from the translations.txt file" error in Google's validator (see below) image

I took a closer look at the spec, and under the feed_info section (see feed_lang), it says to use ISO 639-2 codes for the language (which are 3-letter codes, eg "spa" for spanish), and this is what I used for all the language codes as I was under the impression that it was the standard to use this for all language codes. image

However, upon further investigation, I believe I was wrong, and this is why the feed is returning errors in Google's validator:

In short, I'd like to request a few things:

  1. Could someone clarify which specific international standard language code to use for each particular field in translations.txt (e.g. ISO 639-2 or IETF BCP 47) so that feeds with translations.txt files can get through the validator?
  2. Since the documentation of this on GTFS.org is pretty vague, would it be possible to update the documentation to be more explicit about the specific language code standard required?

Please don't hesitate to let me know if you have any questions or clarifications - my email is juliet@trilliumtransit.com

eliasmbd commented 7 months ago

Thank you for pointing out an inconsistency in the GTFS documentation in your first issue! It’s contributions like yours that help us make the spec more accessible. :rocket:

Sergiodero commented 6 months ago

Hi Juliet!

Thank you for raising these questions and pointing out that source of confusion in such a clear and guided manner, this is always very much appreciated! Based on a quick review of the spec text, our interpretation is that all language code fields (agency.agency_lang, translations.language, feed_info.default_lang and feed_info.default_lang) in the spec, should use the IETF BCP 47 standard as stated in the Field Types section at the top of the Reference document.

The mention of the ISO 639-2 standard seems to refer only to the use of the mul code in feed_info.feed_lang, which is applicable in case of feeds containing information in multiple languages within the same dataset (i.e. translations aside), see PR#180 for further context.

This seems to be the main source of confusion, perhaps a simple solution could be to remove the mention of the ISO 639-2 standard in the description for feed_info.feed_lang, leaving the use of the mul code intact. Alternatively, the IETF BCP 47 standard could be mentioned in each language code field to provide additional clarity, but it could also be argued that this would be redundant if specified in Field Types.

I wonder if other Producers/Consumers have been interpreting this in the same way and which language tags they usually refer to when they write GTFS and/or when using in their validators.