matt-gardner / pra

122 stars 42 forks source link

Create split with user provided negative examples #25

Closed arthurcgusmao closed 6 years ago

arthurcgusmao commented 6 years ago

Hi,

I want to automatically generate both the graph and the split from files (train.tsv, valid.tsv, test.tsv) that already have negative examples. For instance, each of the files would be in the following format:

Alice   Loves   Bob     1
Alice   Loves   Carl    -1
...     ...     ...     {1|-1}

I have seen that it is possible to generate the graph from relation sets that contain only positive triples, and that we can generate a split (with a proportion of automatically generated negative examples) from the graph.

What I am asking is if with the current implementation can we automatically create a split (the directory and the files) with negative examples specified by myself? If yes, how to do it?

PS: If you feel like this discussion should be part of another one instead of having its own topic, please let me know it and I'll move it there.

matt-gardner commented 6 years ago

I'm afraid it's been years since I have touched this code, and I don't know the answer to your question. It's also a super busy time for me, as I'm in the middle of moving, so I don't have the bandwidth to dig into this for you. In the brief look I did just now, it looks like you'll want to write a new version of SplitCreator, though it's possible there's an easier way to go about this... Sorry I can't be of more help.

arthurcgusmao commented 6 years ago

Hey Matt, no worries. Thanks for your always careful attention. I agree that the best would be to change SplitCreator (and then open a pull request). However, because time is short for me too and I'm not very experienced with scala, I wrote a python function to create the splits. I hope this can be useful for more people in the future:

def ensure_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)

def create_split(dfs, splits_dirpath, split_name):
    """Creates a split directory that PRA algorithm can use for the respective dataset.

    Arguments:
    - dfs: a dict whose keys are fold names (e.g. "train", "test") and values are DataFrames with
    head, tail, relation, and label columns.
    - split_dirpath: path where the split should be created.
    """
    this_split_path = splits_dirpath + '/' + split_name
    ensure_dir(splits_dirpath)
    if not os.path.exists(this_split_path):
        os.makedirs(this_split_path)
    else:
        print('Split already exists: {}.'.format(this_split_path))
        return None
        # raise ValueError('Split {} already exists in {}.'.format(
        #         split_name, splits_dirpath))

    # get relations
    rels = set()
    for _, df in dfs.iteritems():
        rels.update(df['relation'].unique())

    # create relations_to_run.tsv file
    with open(this_split_path + '/relations_to_run.tsv', 'w') as f:
        for rel in rels:
            f.write('{}\n'.format(rel))

    # create each relation dir and its files
    for rel in rels:
        for fold_name, df in dfs.iteritems():
            relpath = '{}/{}/'.format(this_split_path, rel)
            ensure_dir(relpath)
            filtered = df.loc[df['relation'] == rel]
            filtered.to_csv('{}/{}.tsv'.format(relpath, fold_name),
                            columns=['head', 'tail', 'label'], index=False, header=False, sep='\t')