metageni / SUPER-FOCUS

A tool for agile functional analysis of shotgun metagenomic data
GNU General Public License v3.0
21 stars 12 forks source link

subsample data #80

Closed linsalrob closed 2 years ago

linsalrob commented 2 years ago

This PR allows you to request a subsample of the data.

If you have very large sequence files, you may not want to process them all, or be able to process all the reads, so you can "subsample" the data.

Note that the approach taken is to take the first n reads, where n is provided by the --subsample option. This was chosen rather than a random subsample of the reads so that if you have R1 and R2 files you should end up with the correct paired reads in each output file (also it is easier to implement).

If you request a --subsample larger than your sequence file, you will get all the sequences.

The subsampled temporary file is written to the temporary directory which is cleaned up before exiting, and thus is not saved.