Expand step 4 - Githubissues

ne1s0n commented 6 years ago

This branch has several interventions on step 4. I've tried to be consistent both in terms naming conventions (variable and command line parameters), and in terms of numbering (for internal elaboration steps).

The script compiles and runs without error, both for single and for paired ends. I've added explanations in the command line help. I can update the online documentation, if required.

Singleton removal Optional, activated with the new command line parameter "-rs". It's done with the vsearch _-fastxfilter command putting minimum accepted depth for a read to 2. Must be done after all fasta files are concatenated.

Dereplication This step is about removing fully duplicated reads (very common in GBS). It is done using vsearch _-derepfulllength command. It is important to tell to vsearch to keep track of the number of times each read is encountered (it's called size in vsearch lingo). In this way no information is actually lost, but files are smaller and analysis faster. To do so I've added two options (-sizein and -sizeout) to each vsearch invocation.

Actual dereplication is done twice, once on each .fasta separatedly, and once on the collated grand file. This is done to ensure the soonest possible the smallest dataset dimension.

I've added the option to concatenate the fastas in groups of N files. So, instead of a single cat *.fasta > bigfile.fasta there's a little bit of logic, with files divided in groups, concatenated, dereplicated, and then put together with the rest of the dataset. This is done to save RAM: we now avoid everywhere to have to analyze the full list to reads (which is full of dupes). This dereplication in blocks is optional and activated through a new command line parameter ("-db", whose value goes to $derep_blocks). If not specified its default value becomes the number of samples, so that we go back to the old behaviour (similar to the big "cat *.fasta").

Multibarcode support I had the same sample present multiple times in the barcode file, with different barcodes (this typically happens when joining sequences from different libraries). Currently GBS-SNP-CROP does not fully support this case and, in previous steps, it creates a bit of conflict. In this step having the same sample on multiple lines only generated extra (useless) operations, since the same file ended up being compressed/decompressed multiple times. I've removed duplicates from @MR_taxa_files to fix the issue and to be able to get the true number of samples in the dataset.

Minimum length also for single end I've implemented a filtering on single end fastq files, so that reads under a (user specified, optional) minimum length are discarded. To avoid introducing yet another command line parameter I've piggybacked on $pear_length, which is used for this exact purpose in case of paired ends.

ne1s0n commented 6 years ago

I've expanded on my previous pull request, with main changes happening on Paired Ends side of step 4. I have now minimize the required hard disk space for temporary files. The original approach was:

step 4.1 on all samples
step 4.2 on all samples
...

This approach created a lot of temporary files (2-3 times the uncompressed fasta sequences) which where often deleted not immediately after they are used. The new approach I implemented is:

step 4.1, 4.2, 4.3 ... for first sample
step 4.1, 4.2, 4.3 ... for second sample
...

In this way it's possible to remove temporary files right away. I've also added a compression step, so that now the files stored in /fastaForRef folder are gzipped.

arthurmelobio commented 6 years ago

Hi @ne1s0n,

Thank you very much for all this useful suggestions to GBS-SNP-CROP. Except for the multibarcode/duplicated genotypes that should be address on step 1 rather than step 4, everything makes a lot of sense, specially when we look at the analysis time improvement.

Probably we'll release this changes as v.3.1.

Best, Arthur

halelab / GBS-SNP-CROP

Expand step 4 #15