Closed cabisarri closed 2 years ago
Hi @cabisarri, thanks for your interest in our work.
The train data follows the same format as the dev data. Given that preprocessing would take a lot of time and users tend to directly adopt the unisar for infence, the current code only preprocess the dev-set. However, you could simple modify the following lines to support train-set then retrain the model by yourself.
https://github.com/microsoft/ContextualSP/blob/ad7d7979957207e5fe23db7db1cad1066665b66b/unified_parser_text_to_sql/step1_schema_linking.py#L432 https://github.com/microsoft/ContextualSP/blob/ad7d7979957207e5fe23db7db1cad1066665b66b/unified_parser_text_to_sql/step2_serialization.py#L167
Change fairseq-preprocess
function here https://github.com/microsoft/ContextualSP/blob/ad7d7979957207e5fe23db7db1cad1066665b66b/unified_parser_text_to_sql/step2_serialization.py#L9 as
cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/train.src \ --outputs {generate_path}/train.bpe.src \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/train.tgt \ --outputs {generate_path}/train.bpe.tgt \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/dev.src \ --outputs {generate_path}/dev.bpe.src \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f"python -m multiprocessing_bpe_encoder \ --encoder-json ./BART-large/encoder.json \ --vocab-bpe ./BART-large/vocab.bpe \ --inputs {generate_path}/dev.tgt \ --outputs {generate_path}/dev.bpe.tgt \ --workers 1 \ --keep-empty" run_command(cmd) cmd = f'fairseq-preprocess --source-lang "src" --target-lang "tgt" \ --trainpref {generate_path}/train.bpe \ --validpref {generate_path}/dev.bpe \ --destdir {generate_path}/bin \ --workers 2 \ --srcdict ./BART-large/dict.src.txt \ --tgtdict ./BART-large/dict.tgt.txt ' subprocess.Popen( cmd, universal_newlines=True, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
Thanks a lot for the details. I will give it a try.
Closed since there is no more activity.
Hi @DreamerDeo, could you help on this?