Too large embedding files when using --asdir

staszekdh commented 11 months ago

A database for 10k sequences calculated with --asdir

python embeddings.py start data_set_fullseq.fasta data_set_fullseq -embedder pt --gpu -bs 0 --asdir

consists of very large files (~20MB each) and the database is 189GB in total. If you build a database without `--asdir', the resulting concatenated embedding file is only 3GB.

staszekdh commented 11 months ago

@Argusmocny Example sequences to reproduce the problem:

>1
MANRDCNADWKISKARRSYKVGYASTRHEDRSTGMTRYYSQYPSLHLKGNWLEEAGFTTGQAVNITVERGQLIIRLVENS
>2
MGAQLYPIREERGSVEVIPYVRLRGRWLDKLGFDVGSRLKIDAEHGRITLTVIERPVPAPVKIPRKLQRLAREAARASASTDGGKA
>3
MTRPEFVPPKRKPYARPAPTCKVGAQHYPAREEYGSEEVIPYVRLRGRWLDKLGFDVGARLKIETRPGCITLTVVERPVVVPKKIPRKLQRTAG
>4
MLTPWEDEPDDARPKRKPYARPARSYRVGALTYPDREECGPTEIVPYLKLRGRWLDKLGFDVGARLKVEATHGSITLTVVERPVPVVKKIPRKLQRRTG
>5
MTDMHSIAQPFEAEVSPANNRQLTVSYASRYPDYSRIPAITLKGQWLEAAGFTTGTAVDVKVMEGCIVLTAQPLAVEESELMQSLRQVCKLSARKQKQVQAFIGVIAGKQKVA

If --asdir is used, the resulting 5 files are 3.4M. If --asdir is not used, the resulting single file is only 1.1M.

Argusmocny commented 10 months ago

There is a bug in embedders.py code, leading to saving unessesry data in single file mode (without --asdir flag) This will be fixed in upcoming update.

labstructbioinf / pLM-BLAST

Too large embedding files when using --asdir #28