unfoldingWord-box3 / td

Gateway Language Database Prototype
https://td.readthedocs.io/
MIT License
0 stars 1 forks source link

translationDatabase

See http://td.rtfd.io/, this repository contains the source files.

If you want to suggest a change, please fork this repo and create a PR, or create an Issue.

Building

Note there is now a data preparation step; see below

Install the sphinx engine

sudo apt-get install python-sphinx

Then run the build script.

NOTE: this top level makefile is just a shorcut to building the html.

make

NOTE: For region approach, conf.py changed to include:

import sys; sys.setrecursionlimit(3200)

Data Preparation

In the data folder are two JSON files which must be refreshed manually at present. These two JSON files are inputs bin/dataprep.py.

During development there is debug code which restricts actions to only a few languages. This is avoid creation of 8K+ directories.

Stats

Some analysis of input language data is available using the stats script:

=============== Summary ===============
        Total number of languages: 8703
          Total number of regions: 6
Total number of gateway languages: 44

===== Languages per Region =====
Africa:2331
Americas:1157
Asia:3142
Europe:420
Pacific:1623
UNKNOWN:30

===== Languages per G/L =====
UNKNOWN:594
am:82
ar:319
as:36
bi:18
bn:30
ceb:59
en:1658
es:21
es-419:605
fa:57
fr:1057
gu:15
ha:1
hi:130
id:1035
ilo:76
km:21
kn:51
lo:59
ml:27
mn:3
mr:29
ms:129
my:105
ne:104
or:18
pa:3
plt:12
pmy:8
ps:31
pt:267
pt-br:5
ru:132
sw:176
ta:21
te:68
th:34
tl:20
tpi:1088
tr:11
ur:60
vi:84
zh:295
$

Miscellaneous

Here is the DDL for the ISO mapping data:

CREATE TABLE [ISO_639-3] (
         Id      char(3) NOT NULL,  -- The three-letter 639-3 identifier
         Part2B  char(3) NULL,      -- Equivalent 639-2 identifier of the bibliographic applications 
                                    -- code set, if there is one
         Part2T  char(3) NULL,      -- Equivalent 639-2 identifier of the terminology applications code 
                                    -- set, if there is one
         Part1   char(2) NULL,      -- Equivalent 639-1 identifier, if there is one    
         Scope   char(1) NOT NULL,  -- I(ndividual), M(acrolanguage), S(pecial)
         Type    char(1) NOT NULL,  -- A(ncient), C(onstructed),  
                                    -- E(xtinct), H(istorical), L(iving), S(pecial)
         Ref_Name   varchar(150) NOT NULL,   -- Reference language name 
         Comment    varchar(150) NULL)       -- Comment relating to one or more of the columns   

Data Quality

For every gateway language called out by a language, this checks to see it is really is a gateway language based on the gateway flag for the language.

The query below returns all languages that are referenced as a gateway language but are not actually gateway languages. As of this writing there are 31 such invalid gateway languages.

select id,code,name,gateway_flag
from uw_language 
where id in (
    select gateway_language_id 
    from uw_language 
    where gateway_language_id is not null
)
and gateway_flag = false
order by code