laurestine/needandambiguity

These are the files relating to the report "Communicative need and lexical ambiguity in semantic domains across languages".

For certain of the scripts to work, the working directory must be set to the directory containing the script. In addition, the clics.sqlite file downloadable from https://github.com/clics/clics3 must also be in the same directory. In order to run code finding frequencies in Swahili, the two news subcorpora of the Helsinki Corpus of Swahili 2.0 must be in a folder called "Helsinki Corpus of Swahili 2", and titled "hcs2_new_news.vrt" and "hcs2_old_news.vrt".

The documents whose names start with only a digit are initial files either downloaded from Concepticon (in the case of conceptrelations.tsv and annotatedconceptlist.csv), extracted from CLICS3 (in the case of clicslanguagecodes.csv), or manually created (in the case of the list of sources used to choose domains, as well as the domain annotations on annotatedconceptlist.csv). Note that Concepticon and CLICS3 are both distributed under Creative Commons Attribution 4.0 International License, which allows re-publishing and building on their data. The documents whose names start with A are the scripts used to compute ambiguity scores for each semantic domain, as well as output files from each of those scripts. The documents whose names start with B are the scripts used to compute frequencies for each semantic domain, as well as output files from each of those scripts. The documents whose names start with C are the script used to count how many concepts from each domain were lexified in CLICS in each language (to filter out domain-language pairs with sparse data) and the resulting data table. Finally, the documents whose names start with D are the script used to create figures and statistics for the report, as well as the figures it creates.

laurestine / needandambiguity

readme