guxd / deep-code-search

DeepCS: Deep Code Search
MIT License
278 stars 85 forks source link

tran rawtxt data to h5 #50

Open hellokitty753159 opened 4 years ago

hellokitty753159 commented 4 years ago

[20, [8, [14, [73]], [14, [36]], [4, [28]]], [4, [1516], [660]], [19, [15, [11, [8, [4, [169], [66], [4]]], [4, [4]]]], [15, [11, [8, [4, [4, [6599]], [9, [7, [4]]]]], [4, [160]]]], [15, [11, [8, [4, [1534], [74], [1216]]], [4, [1216], [74]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [1534]]]], [15, [11, [8, [4, [6057], [8]]], [4, [8], [74]]]], [15, [11, [8, [4, [1516], [196], [909]]], [4, [59]]]]], [12, [13]]] 我的每一条数据是多层嵌套的list,我需要转成h5格式,以至于可以直接在您的程序上进行。但是np.array做不了这个操作。 def save_hdf5(vecs, filename): '''save the processed data into a hdf5 file''' f = tables.open_file(filename, 'w') filters = tables.Filters(complib='blosc', complevel=5) earrays = f.create_earray(f.root, 'phrases', tables.Int16Atom(),shape=(0,),filters=filters) indices = f.create_table("/", 'indices', Index, "a table of indices and lengths") pos = 0 line=1 for x in vecs: print(line) earrays.append(numpy.array(x)) ind = indices.row ind['pos'] = pos ind['length'] = len(x) ind.append() pos += len(x) line=line+1 f.close() 我应该如何修改这段代码,thx。

guxd commented 4 years ago


li-car-fei commented 2 years ago


guxd commented 2 years ago

@li-car-fei 可以参考 中的binarize函数,把对话(a list of sequences)转成earray数组。

Ashbajawed commented 1 year ago

@li-car-fei You can refer to the binarize function in to convert the dialog (a list of sequences) into an array array.

Hey sorry for being dumb but can you please guide what are dialogs in def binarize(dialogs, tokenizer, output_path)

Are these arrays of sentences ? or something else

guxd commented 1 year ago

You can refer to this function which processes the dialog argument to the binarize function.

def get_daily_dial_data(data_path):
    dialogs = []
    dials = open(data_path, 'r').readlines()
    for dial in dials:
        utts = []
        for i, utt in enumerate(dial.rsplit(' __eou__ ')):
            caller = 'A' if i % 2 == 0 else 'B'
            utts.append((caller, utt, np.zeros((1, 1))))
        dialog = {'knowledge': '', 'utts': utts}
    return dialogs

According to this code, dialogs is a list of dialog, and each dialog is a dictionary consists of utts. The utts is a list of sentences.