sheng-z / stog

AMR Parsing as Sequence-to-Graph Transduction
MIT License
154 stars 36 forks source link

How to generate the "amr_2.0_utils" on my own data #3

Closed bcmi220 closed 3 years ago

bcmi220 commented 5 years ago

Hi Sheng,

Thanks for your nice work. Can you offer the scripts or methods to generate the "amr_2.0_utils" on the other dataset? Thank you very much!

sheng-z commented 5 years ago

Hi,

Yes, I plan to clean up and release those scripts, but I'm too busy to do it now, will probably get it done later this year.

Wangpeiyi9979 commented 4 years ago

hi,may I ask have this issue solved?

bjascob commented 4 years ago

Here's some code to generate amr_utils from your training set. Note that you will need to download a few things for this to work (see comments in the code). I can verify this runs and produces the needed files but I'm not certain about any compatibility with different versions of the downloaded files.

#!/usr/bin/python3
import os
from   types import SimpleNamespace
from   stog.data.dataset_readers.amr_parsing.preprocess.recategorizer import Recategorizer
from   stog.data.dataset_readers.amr_parsing.node_utils import NodeUtilities

# amr_utils can be downloaded from https://www.cs.jhu.edu/~s.zhang/data/AMR/amr_2.0_utils.tar.gz
# Inside the tar.gz are 2 files (joints.txt, text_anonymization_rules.json) that I'm no sure where
# the originate from but are used in STOG
if __name__ == '__main__':
    util_dir     = 'data/amr_utils'
    data_dir     = 'data/LDCProcess'
    train_data = os.path.join(data_dir, 'train.txt.features')
    propbank_dir = 'data/AMR-Downloads/propbank-frames-2018-04-20/'
    verbal_fn    = 'data/AMR-Downloads/verbalization-list-v1.06.txt'

    assert os.path.exists(train_data), 'You must annotate the LDC training data before running this script'

    # Create:
    #   entity_type_cooccur_counter.json, name_op_cooccur_counter.json, name_type_cooccur_counter.json,
    #   wiki_span_cooccur_counter.json
    recategorizer = Recategorizer(train_data, build_utils=True, util_dir=util_dir)

    # Creates: lemma_frame_counter.json, frame_lemma_counter.json, senseless_node_counter.json
    # Need to download verbalization-list-v1.06.txt from https://amr.isi.edu/download.html
    #   wget https://amr.isi.edu/download/lists/verbalization-list-v1.06.txt
    # Also download probank frames from https://github.com/propbank/propbank-frames/
    # The 2018-04-20 is the lastest as of 2020/05/24.  Originally the directory was tagged
    # at propbank-frames-xml-2016-03-08.  There are a few releases on the site that might be earlier
    args = SimpleNamespace()
    args.amr_train_files        = [train_data]
    args.propbank_dir           = propbank_dir
    args.verbalization_file     = verbal_fn
    args.dump_dir               = util_dir
    args.train_file_base_freq   = 1.0
    args.propbank_base_freq     = 1.0
    args.propbank_bonus         = 10.0
    args.verbalization_base_freq= 1.0
    args.verbalize_freq         = 100.0
    args.maybe_verbalize_freq   = 100.0
    args.verbalize_bonus        = 10.0
    nu = NodeUtilities.from_raw(
        args.amr_train_files,
        args.propbank_dir,
        args.verbalization_file,
        args.dump_dir,
        args.train_file_base_freq,
        args.propbank_base_freq,
        args.propbank_bonus,
        args.verbalization_base_freq,
        args.verbalize_freq,
        args.maybe_verbalize_freq,
        args.verbalize_bonus)
guopeiming commented 3 years ago

Here's some code to generate amr_utils from your training set. Note that you will need to download a few things for this to work (see comments in the code). I can verify this runs and produces the needed files but I'm not certain about any compatibility with different versions of the downloaded files.

amr_utils can be downloaded from https://www.cs.jhu.edu/~s.zhang/data/AMR/amr_2.0_utils.tar.gz Inside the tar.gz are 2 files (joints.txt, text_anonymization_rules.json) that I'm no sure where the originate from but are used in STOG

Hi, @bjascob I think I find the originate of joints.txt, which is from https://github.com/ChunchuanLv/AMR_AS_GRAPH_PREDICTION/blob/master/data/joints.txt. And I wonder if you know the originate of text_anonymization_rules.json or the script to generate it now?

bjascob commented 3 years ago

Sorry, no idea. You'll have to ask the author.

Wangpeiyi9979 commented 3 years ago

Hi, I find the text_anonymization_rules.json is related to the Anonymized process. It seems that text_anonymization_rules.json in amr_2.0_utils only contains Entities on AMR 2.0 .
Could you share with me the script to generate it for my own datasets? Thanks!