labstructbioinf / pLM-BLAST

Detection of remote homology by comparison of protein language model representations
https://toolkit.tuebingen.mpg.de/tools/plmblast
MIT License
45 stars 5 forks source link

Duplicate indexes in a CSV file generated by the embeddings.py script #51

Closed staszekdh closed 2 months ago

staszekdh commented 2 months ago

For a FASTA file like

>10mh_A [SAM]
DKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQ
>1a27_A [NADP]
ARTVVLITGCSSGIGLHLAVRLASDPSQSFKVYATLRDLKTQ
>1a4i_A [NADP]
GVPIAGRHAVVVGRSKIVGAPMHDLLLWNNATVTTCHSKTAH
>1a5z_A [NAD]
MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKR
>1a71_A [NAD]
AKVTQGSTCAVFGLGGAGLSVIMGCKAAGAARIIGVDINKDK

A database was created using python ~/calc/pLM-BLAST/embeddings.py start sequences.fasta sequences -embedder pt --gpu -bs 0 --asdir. This resulted in a directory sequences with embedding files and an index file sequences.csv. This file looks like this:

,level_0,index,queryid,id,sequence,description,seqlens
0,0,0,0,10mh_A [SAM],DKQLTGLRFIDLFAGLGGFRLALESCGAECVYSNEWDKYAQ,na,41
1,1,1,1,1a27_A [NADP],ARTVVLITGCSSGIGLHLAVRLASDPSQSFKVYATLRDLKTQ,na,42
2,2,2,2,1a4i_A [NADP],GVPIAGRHAVVVGRSKIVGAPMHDLLLWNNATVTTCHSKTAH,na,42
3,3,3,3,1a5z_A [NAD],MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKR,na,37
4,4,4,4,1a71_A [NAD],AKVTQGSTCAVFGLGGAGLSVIMGCKAAGAARIIGVDINKDK,na,42
5,5,5,5,1a7a_A [NAD],DVMIAGKVAVVAGYGDVGKGCAQALRGFGARVIITEIDPIN,na,41

It is missing values in the description column and contains several redundant indexes.