Further reduce memory requirements for dereplication

frederic-mahe commented 5 years ago

This is a follow up on issue #329.

vsearch's memory footprint when dereplicating is significantly reduced with version 2.8.3. However, one would expect the memory footprint to stay the same when the input is 100% redundant (same sequence repeated n times). When testing with 10 and 100 million reads, vsearch's memory footprint keeps growing beyond the initial 100 MB of memory reservation.

As a side note, usearch v11 (32-bit version) does not seem to use a memory-preservation strategy. One point for vsearch :-)

## bash v4.2.46

USEARCH="usearch11.0.667_i86linux32"
VSEARCH="vsearch2.8.3"
NUMBER_OF_SEQUENCES=100000000
SEQUENCE_LENGTH=396
TMP=$(mktemp)

## generate a random sequence and repeat it n times
SEQUENCE=$(tr -dc "ACGT" < /dev/urandom | head -c ${SEQUENCE_LENGTH})
yes $(printf ">s\t%s\n" "${SEQUENCE}") | \
    head -n ${NUMBER_OF_SEQUENCES} | \
    tr " " "\n" > ${TMP}

## file size
INPUT_SIZE=$(stat --printf="%s" ${TMP} | awk '{print $1 / (1024 * 1024)}')
echo "${INPUT_SIZE}"

## usearch 11
/usr/bin/time \
    "${USEARCH}" \
    -fastx_uniques ${TMP} \
    -fastaout /dev/null \
    -sizeout

#     1000 reads    0.4 MB -> usearch     2,088 kB
#   100000 reads   38.1 MB -> usearch    75,560 kB
#  1000000 reads  381.4 MB -> usearch   472,008 kB
# 10000000 reads 3814.7 MB -> usearch 4,088,764 kB (ERROR)

## vsearch 2.8.3
/usr/bin/time \
    "${VSEARCH}" \
    --derep_fulllength ${TMP} \
    --sizeout \
    --quiet \
    --log - \
    --output /dev/null

#       100 reads vsearch 113,152 kB
#      1000 reads vsearch 113,152 kB    (0.4 MB of input)
#   1000000 reads vsearch 113,148 kB    (381 MB of input)
#  10000000 reads vsearch 189,824 kB  (3,814 MB of input)
# 100000000 reads vsearch 763,264 kB (38,147 MB of input)

rm ${TMP}

@torognes I'd be interested to know if my simple bash commands to generate repeated sequences work on iOS. Producing and dereplicating 100 million reads takes only a few minutes on my workstation.

torognes commented 5 years ago

Thanks for showing this. I have now committed some changes that will further reduce memory usage. First, I allocate memory only for 1024 distinct sequences initially, instead of 1M. This will grow as necessary. Additionally, I realised that some of the memory was only really needed when the uc option was in effect (for storing information about the matches). The latter is now only allocated when needed (with the uc option). This should reduce the memory requirements significantly, and dramatically in some cases.

All the tests above now only uses 1.5 MB of memory.

I like that creative bash script, but unfortunately I wont run on macOS due to a few quirks. I found out that the tr command will not accept binary input unless it was preceeded by export LC_ALL=C, as mine was set to unicode and some byte sequences were considered illegal. Also there were some problems with the tab in the printf statement, as well as a missing printf option to stat. Instead I made a single random sequence with a specified abundance and then used the vsearch rereplication command to generate lots of identical sequences.

torognes commented 5 years ago

In some cases the new approach (not reading and storing the entire file first) may use slightly more memory. That is because it needs to dynamically resize a hash table. When it is expanded it must store the old and new table simultaneously. The old approach did not have to resize it because it already knew how many sequences there where in total.

frederic-mahe commented 5 years ago

Wow, that was fast!

I can confirm that the future version 2.8.4 now only uses 1.6 MB for a 100% redundant 38 GB dataset (10 million reads), whereas version 2.8.3 used 745 MB. The new version is slightly slower (144-153 s vs 137-142 s). That's great if the extra-memory was reserved for the uc output. Since I don't use this output format for my giant fasta file dereplications, the memory-footprint will further reduced.

Instead I made a single random sequence with a specified abundance and then used the vsearch rereplication command to generate lots of identical sequences.

I should have thought about this one. Smart hacking.

frederic-mahe commented 5 years ago

In some cases the new approach (not reading and storing the entire file first) may use slightly more memory. That is because it needs to dynamically resize a hash table. When it is expanded it must store the old and new table simultaneously. The old approach did not have to resize it because it already knew how many sequences there where in total.

I am working on a plot showing that almost all real-life metabarcoding samples I have access to present a certain level of redundancy, typically between 60 and 90% (meaning that the size of the dereplicated file should be roughly 10 to 40% of the size of the input file). The new dereplication approach should save memory in most cases.

frederic-mahe commented 5 years ago

Does this work on iOS?

NUMBER_OF_SEQUENCES=100
SEQUENCE_LENGTH=396
TMP=$(mktemp)

## generate a random sequence and repeat it n times
SEQUENCE=$(LC_ALL=C tr -dc "ACGT" < /dev/urandom | head -c ${SEQUENCE_LENGTH})
yes $(printf ">s %s\n" "${SEQUENCE}") | \
    head -n ${NUMBER_OF_SEQUENCES} | \
    tr " " "\n" > ${TMP}

## file size (in megabytes)
stat -c "%s" ${TMP} | awk '{print $1 / (1024 * 1024)}'

rm ${TMP}

torognes commented 5 years ago

No, not directly. The space in the printf argument makes yes only print headers (>s) and no sequence. It could be replaced by underscore (_). Also the stat command has a rather different syntax (-f instead of -c and %z instead of %s). It is perhaps more portable to use ls. The following works well:

NUMBER_OF_SEQUENCES=100
SEQUENCE_LENGTH=396
TMP=$(mktemp)

## generate a random sequence and repeat it n times
SEQUENCE=$(LC_ALL=C tr -dc "ACGT" < /dev/urandom | head -c ${SEQUENCE_LENGTH})
yes $(printf ">s_%s\n" "${SEQUENCE}") | \
    head -n ${NUMBER_OF_SEQUENCES} | \
    tr "_" "\n" > ${TMP}

## file size (in megabytes)
ls -l ${TMP} | awk '{print $5 / (1024 * 1024)}'

rm ${TMP}

torognes commented 5 years ago

I have released version 2.8.4 now with these changes.

frederic-mahe commented 5 years ago

There is an old iMac in my office that is not used anymore. I will set it up as a test "BSD" machine for my scripts. Regarding your use of ls to get the size of the file, it is stated here that:

ls has a similar problem in that the exact format of the output is not specified, so parsing its output cannot be done portably.

Based on this forum entry and other sources, the most portable way to get file size seems to use wc -c. It works well for our ASCII files:

NUMBER_OF_SEQUENCES=100
SEQUENCE_LENGTH=396
TMP=$(mktemp)

## generate a random sequence and repeat it n times
SEQUENCE=$(LC_ALL=C tr -dc "ACGT" < /dev/urandom | head -c ${SEQUENCE_LENGTH})
yes $(printf ">s_%s\n" "${SEQUENCE}") | \
    head -n ${NUMBER_OF_SEQUENCES} | \
    tr "_" "\n" > ${TMP}

## file size (in megabytes)
LC_ALL=C wc -c < ${TMP} | awk '{print $1 / (1024 * 1024)}'

rm ${TMP}

There are unit tests for vsearch that could use that fast and simple code above to produce input data. I am glad we've found something that works for both GNU/Linux and BSD.

torognes / vsearch

Further reduce memory requirements for dereplication #334