Open devhimd19 opened 3 years ago
Thank you for your report! The error means n-gram "WNA" is not trained because the corpus(uniprot trained one) does not contain such sequence, so you have to make your own corpus and train with it by yourself.
The corpus has the WNA.
Can you please see the attached code and the input file.
output1.txt
window_13re.txt
biovec5.txt
I am getting the output but it is still showing the error
Hi, @kyu999
I am facing the exact same error on my end too, but for the n-gram "KQE" instead.
Here's my code snippet -
pv = ProtVec('INPUT.FASTA', corpus_fname='OUTPUT.TXT', n=3)
pv["QAT"]
sequences = list(df[c])
(df[c] contains the AA sequence from which INPUT.FASTA was constructed)
embeddings = []
for i in sequences:
embed = pv.to_vecs(i)
<- Error occurs here
embeddings.append(embed)
Full code block, if it helps -
for d in data:
df = pd.read_csv(d)
dN = d[:-4]
for c in cols:
count = 1
with open('sequences_{a}_{b}.fasta'.format(a = c, b = dN), 'w') as f:
for i in range(len(df)):
print('>' + str(count) + '\n', df[c][i], file = f)
count = count + 1
pv = ProtVec('sequences_{a}_{b}.fasta'.format(a = c, b = dN), corpus_fname='output_{a}_{b}.txt'.format(a = c, b = dN), n=3)
pv["QAT"]
sequences = list(df[c])
embeddings = []
for i in sequences:
embed = pv.to_vecs(i)
embeddings.append(embed)
embedding = np.asarray(embeddings)
all_embeddings = np.reshape(embedding, newshape=(embedding.shape[0], 300))
dF = pd.DataFrame(all_embeddings, columns = colN, dtype = object)
dF['modification'] = df['modifications']
dF.to_csv('dataset-{a}_{b}.model'.format(a = c, b = dN))
pv.save('sequences_{a}_{b}.model'.format(a = c, b = dN))
(Idk why, but I can't seem to get this code block to indent properly.)
Please help me get past this error.