The childes-db project aims to make CHILDES transcripts more accessible by reducing the amount of preprocessing (e.g., CLAN or specific preprocessing libraries) and by making the individual tokens, utterances, transcripts, and corpora available in a tidy, tabular format that is accessible across programming languages. We release new versions of this dataset periodically to facilitate reproducibility. We also provide an R package (childesr) and a Python package (childespy) which allow users to access this database without having to write complex SQL queries.
The following instructions should work to generate a new release of childes-db
on an Ubuntu 16 - 20 machine
cd
into it$ROOT_PASS
in ~/.profile
and source it. Check with echo $ROOT_PASS
virtualenv -p python3 childes-db-py3
source childes-db-py3/bin/activate
pip install -r requirements.txt
Download corpora from the CHIILDES and PhonBank sites, both XML and CHAT files with the following Python script:
python3 download_corpora.py --versions_root /shared_hd2/childes-db --version <version name, e.g. 2020.2>
The following are necessary only if installing MySQL new
Edit the mysqld.cnf file to make MySQL accessible externally and so that it can handle large queries and long import times better
sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
change bind address to 0.0.0.0 to allow access from outside the host
bind-address = 0.0.0.0
Increase the max_allowed_packed size
max_allowed_packet = 256M
Increase the default timeout
wait_timeout=6000
Restart MySQL for changes to take effect
sudo service mysql restart
mysql -u<mysql username> -p<mysql password>
# should start MySQL prompt
create user 'childesdb'@'%' IDENTIFIED BY '<childes db default password>'
\q
Update the config.json
file to point to the name of the database you want to populate. This is necessary in addition to
python3 manage.py makemigrations db
./new_2020.sh
This makes a new database, runs migrate to to initialize the schema, and runs the populate_db
command. As of the 2020 release, this calls the populate_db
command twice internally: once to add datasets from CHILDES, and once to add datasets from Phonbank
To compute the correlation between the Transcript-Speaker counts as computed with childes-db vs. that computed with CLAN:
python3 manage.py test_frequency
If this number is less than .997, check why. Note that this aggregates by speaker role within each transcript, so it won't be able to detect when there is a problem with reliably identifying a child across different transcripts (see issue #61).
To check how the contents of one childes-db database differ from another, we have a Django management command:
python3 manage.py compare_databases --old_db_name 2020.1 --new_db_name childes_db_dev_test2
.
This will hash each record -- either token or utterance -- for each transcript and check which ones have changed and put CSVs into directories entitled utt_mismatch_csvs
and token_mismatch_csvs
.
To copy from the staging server, e.g., Chompsky, to the production server, e.g., EC2, copy databases to the production server with the following command
ssh -p <ssh port on Chompsky> <user>@<chompsky_host> 'mysqldump -u <chompsky_username> -p<chompsky password> --databases <database version> | gzip -c' | gunzip -c | mysql -u <EC2 MySQL uersname> -p<EC2 MySQL password>
mysql -u<mysql username> -p<mysql password>
# should start MySQL prompt
GRANT SELECT ON `2020.1`.* TO '<mysql read-only user>'@'%' IDENTIFIED BY '<mysql password>'
GRANT SELECT ON `2019.1`.* TO '<mysql read-only user>'@'%' IDENTIFIED BY '<mysql password>'
GRANT SELECT ON `2018.1`.* TO '<mysql read-only user>'@'%' IDENTIFIED BY '<mysql password>'
mysql -u<mysql username> -p<mysql password>
# should start MySQL prompt
create view admin as select * from `2020.1`.`admin`;
create view collection as select * from `2020.1`.`collection`;
create view corpus as select * from childesdb.`corpus`;
create view django_content_type as select * from `2018.1`.`django_content_type`;
create view django_migrations as select * from `2018.1`.`django_migrations`;
create view participant as select * from `2020.1`.`participant`;
create view token as select * from `2020.1`.`token`;
create view `token_frequency` as select * from `2018.1`.`token_frequency`;
create view transcript as select * from `2020.1`.`transcript`;
create view transcript_by_speaker as select * from `2018.1`.`transcript_by_speaker`;
create view utterance as select * from `2020.1`.`utterance`
\q
Give permissions to the the reader to read from the alias
mysql -u<mysql username> -p<mysql password>
# should start MySQL prompt
GRANT SELECT ON `childesdb`.* TO '<mysql read-only user>'@'%' IDENTIFIED BY '<mysql password>'
GRANT SHOW VIEW ON `childesdb`.* TO '<mysql user>'@'%' IDENTIFIED BY '<mysql password>'