This is Phonetoque, a tool for phonetic computation intended to serve language learning apps. What you see here is the data collection pipeline for Phonetoque.
Everything is done through the manager.py script.
python manager.py [-h] [--language LANGUAGE] [-i I] [-o O] [--conf CONF] script
Options :
--language specify the language input and output data apply to
--conf which conf file to use, defaults to /scripts/script_config.yml
-i input file
-o output file
-h help
Available scripts are in /scripts. What options are required depend on the script.
This project also includes a simple Flask API used to communicate with our database (currently an Mlab test instance). To run it:
docker-compose up
It should run on 127.0.0.1:5000 in debug mode. Make sure this is running whenever you want to run a script that interacts with the database.
manager.py scrape [-h] --language LANGUAGE -i INPUT_FILE -o OUTPUT_FILE [--conf CONF]
Scrapes wiktionary.org (in the appropriate language) for IPA pronunciations of a list of words (given in input file, one per line) and writes them to output file in the following format:
word pronunciation
word
word pronunciation pronunciation pronunciation
...
manager.py topatgen [-h] --language LANGUAGE -i INPUT_FILE -o OUTPUT_FILE [--conf CONF]
Processes pronunciation data to find pronunciations that are already broken into syllables and outputs known syllabified pronunciations in a format that is friendly to pypatgen. TODO document workflow for pypatgen
usage: post.py [-h] --language LANGUAGE -i INPUT_FILE [--conf CONF]
Posts word and pronunciation data to database (via the Flask API). You should have a trained syllabification dictionary for pyphen for both your language and its IPA (see TODO above).