Open hellokitty753159 opened 4 years ago
现有代码只支持线性的序列化,不支持嵌套,你可以把你的方括号也看成一个字符,这样就可以当成一个序列了。
要怎么将txt文件转成符合您的数据集中的.h5格式呢?
@li-car-fei 可以参考 https://github.com/guxd/DialogBERT/blob/master/prepare_data.py 中的binarize函数,把对话(a list of sequences)转成earray数组。
@li-car-fei You can refer to the binarize function in https://github.com/guxd/DialogBERT/blob/master/prepare_data.py to convert the dialog (a list of sequences) into an array array.
Hey sorry for being dumb but can you please guide what are dialogs in
def binarize(dialogs, tokenizer, output_path)
Are these arrays of sentences ? or something else
You can refer to this function which processes the dialog
argument to the binarize
function.
def get_daily_dial_data(data_path):
dialogs = []
dials = open(data_path, 'r').readlines()
for dial in dials:
utts = []
for i, utt in enumerate(dial.rsplit(' __eou__ ')):
caller = 'A' if i % 2 == 0 else 'B'
utts.append((caller, utt, np.zeros((1, 1))))
dialog = {'knowledge': '', 'utts': utts}
dialogs.append(dialog)
return dialogs
According to this code, dialogs is a list of dialog
, and each dialog
is a dictionary consists of utts
. The utts
is a list of sentences.
[20, [8, [14, [73]], [14, [36]], [4, [28]]], [4, [1516], [660]], [19, [15, [11, [8, [4, [169], [66], [4]]], [4, [4]]]], [15, [11, [8, [4, [4, [6599]], [9, [7, [4]]]]], [4, [160]]]], [15, [11, [8, [4, [1534], [74], [1216]]], [4, [1216], [74]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [1534]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [74]]]], [15, [11, [8, [4, [1516], [196], [909]]], [4, [59]]]]], [12, [13]]] 我的每一条数据是多层嵌套的list,我需要转成h5格式,以至于可以直接在您的程序上进行。但是np.array做不了这个操作。
def save_hdf5(vecs, filename): '''save the processed data into a hdf5 file''' f = tables.open_file(filename, 'w') filters = tables.Filters(complib='blosc', complevel=5) earrays = f.create_earray(f.root, 'phrases', tables.Int16Atom(),shape=(0,),filters=filters) indices = f.create_table("/", 'indices', Index, "a table of indices and lengths") pos = 0 line=1 for x in vecs: print(line) earrays.append(numpy.array(x)) ind = indices.row ind['pos'] = pos ind['length'] = len(x) ind.append() pos += len(x) line=line+1 f.close()
我应该如何修改这段代码,thx。