Closed arthurcgusmao closed 6 years ago
I'm afraid it's been years since I have touched this code, and I don't know the answer to your question. It's also a super busy time for me, as I'm in the middle of moving, so I don't have the bandwidth to dig into this for you. In the brief look I did just now, it looks like you'll want to write a new version of SplitCreator
, though it's possible there's an easier way to go about this... Sorry I can't be of more help.
Hey Matt, no worries. Thanks for your always careful attention. I agree that the best would be to change SplitCreator
(and then open a pull request). However, because time is short for me too and I'm not very experienced with scala, I wrote a python function to create the splits. I hope this can be useful for more people in the future:
def ensure_dir(path):
if not os.path.exists(path):
os.makedirs(path)
def create_split(dfs, splits_dirpath, split_name):
"""Creates a split directory that PRA algorithm can use for the respective dataset.
Arguments:
- dfs: a dict whose keys are fold names (e.g. "train", "test") and values are DataFrames with
head, tail, relation, and label columns.
- split_dirpath: path where the split should be created.
"""
this_split_path = splits_dirpath + '/' + split_name
ensure_dir(splits_dirpath)
if not os.path.exists(this_split_path):
os.makedirs(this_split_path)
else:
print('Split already exists: {}.'.format(this_split_path))
return None
# raise ValueError('Split {} already exists in {}.'.format(
# split_name, splits_dirpath))
# get relations
rels = set()
for _, df in dfs.iteritems():
rels.update(df['relation'].unique())
# create relations_to_run.tsv file
with open(this_split_path + '/relations_to_run.tsv', 'w') as f:
for rel in rels:
f.write('{}\n'.format(rel))
# create each relation dir and its files
for rel in rels:
for fold_name, df in dfs.iteritems():
relpath = '{}/{}/'.format(this_split_path, rel)
ensure_dir(relpath)
filtered = df.loc[df['relation'] == rel]
filtered.to_csv('{}/{}.tsv'.format(relpath, fold_name),
columns=['head', 'tail', 'label'], index=False, header=False, sep='\t')
Hi,
I want to automatically generate both the graph and the split from files (
train.tsv
,valid.tsv
,test.tsv
) that already have negative examples. For instance, each of the files would be in the following format:I have seen that it is possible to generate the graph from relation sets that contain only positive triples, and that we can generate a split (with a proportion of automatically generated negative examples) from the graph.
What I am asking is if with the current implementation can we automatically create a split (the directory and the files) with negative examples specified by myself? If yes, how to do it?
PS: If you feel like this discussion should be part of another one instead of having its own topic, please let me know it and I'll move it there.