rki-mf1 / covsonar

A database-driven system for handling genomic sequences of SARS-CoV-2 and screening genomic profiles.
GNU General Public License v3.0
6 stars 0 forks source link

Import cache takes up a lot of disk space #111

Closed matthuska closed 11 months ago

matthuska commented 1 year ago

When importing data, covsonar "caches" all relevant information in files on the filesystem before doing alignment and ultimately adding the data to the database. This cache can be significantly larger than the final database that contains all imported data. It would be nice to reduce the size of the cache and number of files being placed there.

Here's an example from importing 90k sequences and associated metadata:

cache/ $ du -hs *
0       import.log
1.3M    ref
399M    samples
3.4G    seq
1.8G    var

And the database itself is 2.1G.

Requested by: Jule@HPI

matthuska commented 1 year ago

The sequences are taking up the majority of space (see the first comment), and I think that one of the main reasons they are kept even after alignment is that they are used to detect duplicate sequences.

I'd suggest calculating seqhashes for all sequences that are supposed to be imported, and then checking them against each other as well as the database to identify duplicates before any other processing takes place (in a single thread, before any parallel processing starts). Then we could only write the sequences to disk just before they are mapped, and delete them after we have the variants. In this case only a small number of sequences (= the number of processes being used) would ever simultaneously be sitting on the disk.

matthuska commented 11 months ago

Closed due to stopping development on covsonar 2.

See https://github.com/rki-mf1/sonar-cli/wiki/Importing-sequences for work on this topic for the new version of sonar.