Closed staszekdh closed 9 months ago
@Argusmocny Example sequences to reproduce the problem:
>1
MANRDCNADWKISKARRSYKVGYASTRHEDRSTGMTRYYSQYPSLHLKGNWLEEAGFTTGQAVNITVERGQLIIRLVENS
>2
MGAQLYPIREERGSVEVIPYVRLRGRWLDKLGFDVGSRLKIDAEHGRITLTVIERPVPAPVKIPRKLQRLAREAARASASTDGGKA
>3
MTRPEFVPPKRKPYARPAPTCKVGAQHYPAREEYGSEEVIPYVRLRGRWLDKLGFDVGARLKIETRPGCITLTVVERPVVVPKKIPRKLQRTAG
>4
MLTPWEDEPDDARPKRKPYARPARSYRVGALTYPDREECGPTEIVPYLKLRGRWLDKLGFDVGARLKVEATHGSITLTVVERPVPVVKKIPRKLQRRTG
>5
MTDMHSIAQPFEAEVSPANNRQLTVSYASRYPDYSRIPAITLKGQWLEAAGFTTGTAVDVKVMEGCIVLTAQPLAVEESELMQSLRQVCKLSARKQKQVQAFIGVIAGKQKVA
If --asdir
is used, the resulting 5 files are 3.4M. If --asdir
is not used, the resulting single file is only 1.1M.
There is a bug in embedders.py
code, leading to saving unessesry data in single file mode (without --asdir
flag) This will be fixed in upcoming update.
A database for 10k sequences calculated with
--asdir
consists of very large files (~20MB each) and the database is 189GB in total. If you build a database without `--asdir', the resulting concatenated embedding file is only 3GB.