PMLTQ conversion script

m1ch4ls commented 9 years ago

We need a new and a better way to do conversion to sql.

My proposal is to have a single script pmltq (we have that one already) and add admin commands to it. This should lead to major simplification. Right now we have skeleton directory structure of one config and about four executables. This is difficult to update and also difficult to work with when we have about 100 treebanks.

`pmltq convert treebank_config.yml /sql_dir/`

Conversion process - it's already done. See contrib/pml2base.btred it uses PMLTQ/PML2BASE.pm. The process uses btred but I don't think it's necessary and it would be great if we don't depend on btred (if possible). The changes will be required in PML2BASE and some tests will be great too.

`pmltq initdb treebank_config.yml`

Every database needs to import this. The sql is in the shared directory.

init_postgres.sql
pml2base_init-pg.sql

The oracle sql is not needed. We are not supporting oracle anymore.

`pmltq load treebank_config.yml /sql_dir/`

The generated files are not actual sql, but csv like files. For every layer there is an import bash script. We can keep it that way or not - PMLTQ is linux specific anyway.

`pmltq verify treebank_config.yml`

Check if database exists and that it contains some data - I don't mean complete verification, but some check so I can quickly see if the treebank is imported.

`pmltq delete treebank_config.yml`

Delete the treebank from database.

`pmltq query [treebank_config.yml]`

Current interface hidden under query command with an optional config.

Example config file with comments:


---
data_dir: /pmltq/data/dir/ # directory where the data are (this is also base directory for data layers)
resources: /pmltq/resources/ # main directory with PML schemas

db: # typical DB auth stuff
  name: treebank_db_name
  host: localhost
  port: 5432
  user: pmltq
  password: pwd

layers: # description of all data layers
  - name: adata
    data: ./relative/to/data_dir/**/*.a.gz
    resources:
      - extra_resource
    references:
      - [ t-node/val_frame.rf ] # equals to t-node/val_frame.rf=-
      - [ t-a/aux.rf, "adata:a-node" ] # t-a/aux.rf=adata:a-node
      - [ t-node/coref_gram.rf, a-node ] # t-node/coref_gram.rf=t-node
  - name: tdata
    data: **/*.t.gz
  - name: vallency_lexicon
    data: ./valex_pml

choroba commented 9 years ago

If you want to skip btred, you'll have to reimplement Effective Parents and similar (IIRC).

m1ch4ls commented 9 years ago

Good point! We have to figure out how to handle user defined relations.

Currently there are if statements in pml2base.btred for each custom relation. I don't like that very much, but I also don't mind living with that.

dan-zeman commented 9 years ago

I like the idea of skipping btred (especially because I was unable to run it). I process treebanks almost exclusively in Treex, so having a Write::PMLTQ block in Treex would be a solution I'd prefer. (Then there could be inherited blocks specialized on custom relations in particular treebanks; the parent block would just provide the default.)

ufal / perl-pmltq