kyu999 / biovec

A new approach for representing biological sequences
https://pypi.org/project/biovec/
99 stars 33 forks source link

raise Exception("Model has never trained this n-gram: " + ngram) Exception: Model has never trained this n-gram: WNA #14

Open devhimd19 opened 3 years ago

devhimd19 commented 3 years ago

Screenshot from 2021-08-27 15-55-29

kyu999 commented 3 years ago

Thank you for your report! The error means n-gram "WNA" is not trained because the corpus(uniprot trained one) does not contain such sequence, so you have to make your own corpus and train with it by yourself.

devhimd19 commented 3 years ago

The corpus has the WNA. Can you please see the attached code and the input file.
output1.txt window_13re.txt biovec5.txt Screenshot from 2021-09-03 11-02-12

I am getting the output but it is still showing the error

AliASafdari commented 2 years ago

Hi, @kyu999

I am facing the exact same error on my end too, but for the n-gram "KQE" instead.

Here's my code snippet -

pv = ProtVec('INPUT.FASTA', corpus_fname='OUTPUT.TXT', n=3) pv["QAT"] sequences = list(df[c]) (df[c] contains the AA sequence from which INPUT.FASTA was constructed) embeddings = [] for i in sequences: embed = pv.to_vecs(i) <- Error occurs here embeddings.append(embed)

Full code block, if it helps -

for d in data: df = pd.read_csv(d) dN = d[:-4] for c in cols: count = 1 with open('sequences_{a}_{b}.fasta'.format(a = c, b = dN), 'w') as f: for i in range(len(df)): print('>' + str(count) + '\n', df[c][i], file = f) count = count + 1 pv = ProtVec('sequences_{a}_{b}.fasta'.format(a = c, b = dN), corpus_fname='output_{a}_{b}.txt'.format(a = c, b = dN), n=3) pv["QAT"] sequences = list(df[c]) embeddings = [] for i in sequences: embed = pv.to_vecs(i) embeddings.append(embed) embedding = np.asarray(embeddings) all_embeddings = np.reshape(embedding, newshape=(embedding.shape[0], 300)) dF = pd.DataFrame(all_embeddings, columns = colN, dtype = object) dF['modification'] = df['modifications'] dF.to_csv('dataset-{a}_{b}.model'.format(a = c, b = dN)) pv.save('sequences_{a}_{b}.model'.format(a = c, b = dN))

(Idk why, but I can't seem to get this code block to indent properly.)

Please help me get past this error.