guxd / deep-code-search

DeepCS: Deep Code Search
MIT License
278 stars 85 forks source link

tran rawtxt data to h5 #50

Open hellokitty753159 opened 4 years ago

hellokitty753159 commented 4 years ago

[20, [8, [14, [73]], [14, [36]], [4, [28]]], [4, [1516], [660]], [19, [15, [11, [8, [4, [169], [66], [4]]], [4, [4]]]], [15, [11, [8, [4, [4, [6599]], [9, [7, [4]]]]], [4, [160]]]], [15, [11, [8, [4, [1534], [74], [1216]]], [4, [1216], [74]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [1534]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [74]]]], [15, [11, [8, [4, [1516], [196], [909]]], [4, [59]]]]], [12, [13]]] 我的每一条数据是多层嵌套的list,我需要转成h5格式,以至于可以直接在您的程序上进行。但是np.array做不了这个操作。 def save_hdf5(vecs, filename): '''save the processed data into a hdf5 file''' f = tables.open_file(filename, 'w') filters = tables.Filters(complib='blosc', complevel=5) earrays = f.create_earray(f.root, 'phrases', tables.Int16Atom(),shape=(0,),filters=filters) indices = f.create_table("/", 'indices', Index, "a table of indices and lengths") pos = 0 line=1 for x in vecs: print(line) earrays.append(numpy.array(x)) ind = indices.row ind['pos'] = pos ind['length'] = len(x) ind.append() pos += len(x) line=line+1 f.close() 我应该如何修改这段代码,thx。

guxd commented 4 years ago

现有代码只支持线性的序列化,不支持嵌套,你可以把你的方括号也看成一个字符,这样就可以当成一个序列了。

li-car-fei commented 2 years ago

要怎么将txt文件转成符合您的数据集中的.h5格式呢?

guxd commented 2 years ago

@li-car-fei 可以参考 https://github.com/guxd/DialogBERT/blob/master/prepare_data.py 中的binarize函数,把对话(a list of sequences)转成earray数组。

Ashbajawed commented 1 year ago

@li-car-fei You can refer to the binarize function in https://github.com/guxd/DialogBERT/blob/master/prepare_data.py to convert the dialog (a list of sequences) into an array array.

Hey sorry for being dumb but can you please guide what are dialogs in def binarize(dialogs, tokenizer, output_path)

Are these arrays of sentences ? or something else

guxd commented 1 year ago

You can refer to this function which processes the dialog argument to the binarize function.

def get_daily_dial_data(data_path):
    dialogs = []
    dials = open(data_path, 'r').readlines()
    for dial in dials:
        utts = []
        for i, utt in enumerate(dial.rsplit(' __eou__ ')):
            caller = 'A' if i % 2 == 0 else 'B'
            utts.append((caller, utt, np.zeros((1, 1))))
        dialog = {'knowledge': '', 'utts': utts}
        dialogs.append(dialog)
    return dialogs

According to this code, dialogs is a list of dialog, and each dialog is a dictionary consists of utts. The utts is a list of sentences.