samba_sampler is an on-going project for providing better sampling methods in general, but particularly designed for linguistic typology. It attempts to address issues of vertical and spatial autocorrelation (i.e., Galton's problem [1]) and is already been used by two Ph.D. students at the University of Uppsala (Sweden) and will be submited for peer-reviewed publication in about a month.
In order to be used without complex instructions for installation (as the target audience is not necessarily advanced in computer proficiency), I have decided to distribute the package with all necessary data for normal usage, thus including: (a) a custom dump of Glottolog's [2] data, (b) a pre-computed matrix of distance from GLED's "world tree" [3], (c) a pre-computed matrix of distance with Haversine distances, and (d) a pre-computed matrix of distance with walking distances, adapted from Guzman Naranjo & Jäger (2023) [4]. It is important to distribute this data both for computation speed and easy of access. The matrices are square matrices involving over 8,000 different language varieties, all with geographic coordinates.
In order to reduce the package size, I have implemented a custom class that uses Python's arrays instead of (pickled) lists, and set the datatype to unsigned integers; the files are also compressed with the highest protocol using bzip2.
Unfortunately, even with these measures I am now over the 100 Mb limit. I am requesting a new limit of 175 Mb, which should be more than enough to fit the package once all the matrices are integrated (I estimate the final size will be about 130 Mb).
Project URL
https://pypi.org/project/samba_sampler
Does this project already exist?
New Limit
175 MB
Update issue title
Which indexes
PyPI, TestPyPI
About the project
samba_sampler
is an on-going project for providing better sampling methods in general, but particularly designed for linguistic typology. It attempts to address issues of vertical and spatial autocorrelation (i.e., Galton's problem [1]) and is already been used by two Ph.D. students at the University of Uppsala (Sweden) and will be submited for peer-reviewed publication in about a month.[1] https://en.wikipedia.org/wiki/Galton%27s_problem
Reasons for the request
In order to be used without complex instructions for installation (as the target audience is not necessarily advanced in computer proficiency), I have decided to distribute the package with all necessary data for normal usage, thus including: (a) a custom dump of Glottolog's [2] data, (b) a pre-computed matrix of distance from GLED's "world tree" [3], (c) a pre-computed matrix of distance with Haversine distances, and (d) a pre-computed matrix of distance with walking distances, adapted from Guzman Naranjo & Jäger (2023) [4]. It is important to distribute this data both for computation speed and easy of access. The matrices are square matrices involving over 8,000 different language varieties, all with geographic coordinates.
In order to reduce the package size, I have implemented a custom class that uses Python's arrays instead of (pickled) lists, and set the datatype to unsigned integers; the files are also compressed with the highest protocol using bzip2.
Unfortunately, even with these measures I am now over the 100 Mb limit. I am requesting a new limit of 175 Mb, which should be more than enough to fit the package once all the matrices are integrated (I estimate the final size will be about 130 Mb).
[2] https://www.glottolog.org [3] https://doi.org/10.5281/zenodo.7368116 [4] https://doi.org/10.12688/openreseurope.16141.1 [5] https://github.com/tresoldi/samba_sampler/blob/main/src/samba_sampler/common.py
Code of Conduct