microsoft / ContextualSP

Multiple paper open-source codes of the Microsoft Research Asia DKI group
MIT License
374 stars 62 forks source link

Could you share the post-processing script or the post-processed train data for UniSar? #32

Closed cabisarri closed 2 years ago

SivilTaram commented 2 years ago

Hi @DreamerDeo, could you help on this?

longxudou commented 2 years ago

Hi @cabisarri, thanks for your interest in our work.

The train data follows the same format as the dev data. Given that preprocessing would take a lot of time and users tend to directly adopt the unisar for infence, the current code only preprocess the dev-set. However, you could simple modify the following lines to support train-set then retrain the model by yourself.

https://github.com/microsoft/ContextualSP/blob/ad7d7979957207e5fe23db7db1cad1066665b66b/unified_parser_text_to_sql/step1_schema_linking.py#L432 https://github.com/microsoft/ContextualSP/blob/ad7d7979957207e5fe23db7db1cad1066665b66b/unified_parser_text_to_sql/step2_serialization.py#L167

Change fairseq-preprocess function here https://github.com/microsoft/ContextualSP/blob/ad7d7979957207e5fe23db7db1cad1066665b66b/unified_parser_text_to_sql/step2_serialization.py#L9 as

cmd = f"python -m multiprocessing_bpe_encoder \
          --encoder-json ./BART-large/encoder.json \
          --vocab-bpe ./BART-large/vocab.bpe \
          --inputs {generate_path}/train.src \
          --outputs {generate_path}/train.bpe.src \
          --workers 1 \
          --keep-empty"
run_command(cmd)

cmd = f"python -m multiprocessing_bpe_encoder \
        --encoder-json ./BART-large/encoder.json \
        --vocab-bpe ./BART-large/vocab.bpe \
        --inputs {generate_path}/train.tgt \
        --outputs {generate_path}/train.bpe.tgt \
        --workers 1 \
        --keep-empty"
run_command(cmd)

cmd = f"python -m multiprocessing_bpe_encoder \
        --encoder-json ./BART-large/encoder.json \
        --vocab-bpe ./BART-large/vocab.bpe \
        --inputs {generate_path}/dev.src \
        --outputs {generate_path}/dev.bpe.src \
        --workers 1 \
        --keep-empty"
run_command(cmd)

cmd = f"python -m multiprocessing_bpe_encoder \
        --encoder-json ./BART-large/encoder.json \
        --vocab-bpe ./BART-large/vocab.bpe \
        --inputs {generate_path}/dev.tgt \
        --outputs {generate_path}/dev.bpe.tgt \
        --workers 1 \
        --keep-empty"
run_command(cmd)

cmd = f'fairseq-preprocess --source-lang "src" --target-lang "tgt" \
    --trainpref {generate_path}/train.bpe \
    --validpref {generate_path}/dev.bpe \
    --destdir {generate_path}/bin \
    --workers 2 \
    --srcdict ./BART-large/dict.src.txt \
    --tgtdict ./BART-large/dict.tgt.txt '

subprocess.Popen(
    cmd, universal_newlines=True, shell=True,
    stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
cabisarri commented 2 years ago

Thanks a lot for the details. I will give it a try.

SivilTaram commented 2 years ago

Closed since there is no more activity.