About

This is Phonetoque, a tool for phonetic computation intended to serve language learning apps. What you see here is the data collection pipeline for Phonetoque.

Usage

Everything is done through the manager.py script.

   python manager.py [-h] [--language LANGUAGE] [-i I] [-o O] [--conf CONF] script
   Options :
    --language      specify the language input and output data apply to
    --conf          which conf file to use, defaults to /scripts/script_config.yml
    -i              input file
    -o              output file
    -h              help

Available scripts are in /scripts. What options are required depend on the script.

Flask

This project also includes a simple Flask API used to communicate with our database (currently an Mlab test instance). To run it:

docker-compose up

It should run on 127.0.0.1:5000 in debug mode. Make sure this is running whenever you want to run a script that interacts with the database.

Scripts

Scrape

manager.py scrape [-h] --language LANGUAGE -i INPUT_FILE -o OUTPUT_FILE [--conf CONF]

Scrapes wiktionary.org (in the appropriate language) for IPA pronunciations of a list of words (given in input file, one per line) and writes them to output file in the following format:

word pronunciation
word
word pronunciation pronunciation pronunciation
...

Topatgen

manager.py topatgen [-h] --language LANGUAGE -i INPUT_FILE -o OUTPUT_FILE [--conf CONF]

Processes pronunciation data to find pronunciations that are already broken into syllables and outputs known syllabified pronunciations in a format that is friendly to pypatgen. TODO document workflow for pypatgen

Post

usage: post.py [-h] --language LANGUAGE -i INPUT_FILE [--conf CONF]

Posts word and pronunciation data to database (via the Flask API). You should have a trained syllabification dictionary for pyphen for both your language and its IPA (see TODO above).

ludoge / phonetoque