labstructbioinf / pLM-BLAST

Detection of remote homology by comparison of protein language model representations
https://toolkit.tuebingen.mpg.de/tools/plmblast
MIT License
41 stars 5 forks source link

Too large embedding files when using --asdir #28

Closed staszekdh closed 9 months ago

staszekdh commented 11 months ago

A database for 10k sequences calculated with --asdir

python embeddings.py start data_set_fullseq.fasta data_set_fullseq -embedder pt --gpu -bs 0 --asdir

consists of very large files (~20MB each) and the database is 189GB in total. If you build a database without `--asdir', the resulting concatenated embedding file is only 3GB.

staszekdh commented 11 months ago

@Argusmocny Example sequences to reproduce the problem:

>1
MANRDCNADWKISKARRSYKVGYASTRHEDRSTGMTRYYSQYPSLHLKGNWLEEAGFTTGQAVNITVERGQLIIRLVENS
>2
MGAQLYPIREERGSVEVIPYVRLRGRWLDKLGFDVGSRLKIDAEHGRITLTVIERPVPAPVKIPRKLQRLAREAARASASTDGGKA
>3
MTRPEFVPPKRKPYARPAPTCKVGAQHYPAREEYGSEEVIPYVRLRGRWLDKLGFDVGARLKIETRPGCITLTVVERPVVVPKKIPRKLQRTAG
>4
MLTPWEDEPDDARPKRKPYARPARSYRVGALTYPDREECGPTEIVPYLKLRGRWLDKLGFDVGARLKVEATHGSITLTVVERPVPVVKKIPRKLQRRTG
>5
MTDMHSIAQPFEAEVSPANNRQLTVSYASRYPDYSRIPAITLKGQWLEAAGFTTGTAVDVKVMEGCIVLTAQPLAVEESELMQSLRQVCKLSARKQKQVQAFIGVIAGKQKVA

If --asdir is used, the resulting 5 files are 3.4M. If --asdir is not used, the resulting single file is only 1.1M.

Argusmocny commented 10 months ago

There is a bug in embedders.py code, leading to saving unessesry data in single file mode (without --asdir flag) This will be fixed in upcoming update.