nhoffman / dada2-nf

A Nextflow pipeline for processing 16S rRNA sequences using dada2
0 stars 2 forks source link

add script and process to extract unmerged seqs from the rds file #49

Closed dhoogest closed 2 years ago

dhoogest commented 2 years ago

This MR adds a new Rscript which mimics the logic in the old NGS16S pipeline to output a csv of shape:

weight,sequence

Where the listed seqs represent svs dropped during the 'chimera check' process. params-ngs16s.json test data set runs through to completion, but does not appear to have any svs dropped as chimeras to serve as verification.

TODO: extend test set to include sample known to include chimeras

dhoogest commented 2 years ago

Okay I've added another test sample which was used in an original issue for validating the dropped chimera R code. With this sample included in the test-single set, the chim_dropped.csv file is confirmed to contain seqs and weights.

dhoogest@gattaca:~/src/dada2-nf$ ./nextflow run main.nf -params-file params-ngs16s.json
...
dhoogest@gattaca:~/src/dada2-nf$ xsv table output-single/dada/624-27/counts.csv
sampleid  filtered_and_trimmed  denoised_r1  denoised_r2  merged  no_chimeras
624-27    5197                  5068         4906         4597    4235
dhoogest@gattaca:~/src/dada2-nf$ xsv table output-single/dada/624-27/chim_dropped.csv
weight  sequence
120     CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATACTTAGCTTGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTTGGGTGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAATCCTTCGATGGGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
80      CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATGCTTAGCTCGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTTGGGTGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAATCCTTCGATGGGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
49      CAGGCTTAACACATGCAAGTCGAGGGGAAACGACGGGGAAGCTTGCTTCCCCGGGCGTCGACCGGCGCACGGGTGAGTAACGCGTATCCAACCTGCCTCTGACTAAGGGATAACCCGGCGAAAGTCGGACTAATACCTTATGGCATCGTCTGCGGGCATCCAACGACGATTAAAGATTCATCGGTCAGGGATGGGGATGCGTCTGATTAGCTTGTTGGCGGGGTAACGGCCCACCAAGGCGACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
25      CAGGCTTAACACATGCAAGTCGAGGGGAAACGACATTGAAGCTTGCTTCGATGGTCGTCGACCGGCGCACGGGTGAGTAACGCGTATCCAACCTGCCTCTGACTGAGGGATAACCCGTCGAAAGTCGGCCTAATACCTCATGGCATCGTCTGCGGGCATCCAACGACGATTAAAGATTTCATCGGTCAGGGATGGGGATGCGTCTGATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCTACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
20      CAGGCTTAACACATGCAAGTCGAGGGGAAACGACATTGAAGCTTGCTTCGATGGTCGTCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTCGGGCGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAATCCTTCGATGGGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
19      CAGGCTTAACACATGCAAGTCGAGGGGAAACGACATTGAAGCTTGCTTCGATGGTCGTCGACCGGCGCACGGGTGAGTAACGCGTATCCAACCTGCCTCTGACTGAGGGATAACCCGTCGAAAGTCGGCCTAATACCTCATGGCATCGTCTGCGGGCATCCAACGACGATTAAAGATTTCATCGGTCAGGGATGGGGATGCGTCTGATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
12      CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATACTTAGCTTGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTCGGGCGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAAGGCGACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
11      CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATACTTAGCTTGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTATCCAACCTGCCTCTGACTGAGGGATAACCCGTCGAAAGTCGGCCTAATACCTCATGGCATCGTCTGCGGGCATCCAACGACGATTAAAGATTTCATCGGTCAGGGATGGGGATGCGTCTGATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCTACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
10      CAGGCTTAACACATGCAAGTCGAGGGGAAACGACGGGGAAGCTTGCTTCCCCGGGCGTCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTTGGGTGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAATCCTTCGATGGGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
6       CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATACTTAGCTTGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTATCCAACCTGCCTCTGACTGAGGGATAACCCGTCGAAAGTCGGCCTAATACCTCATGGCATCGTCTGCGGGCATCCAACGACGATTAAAGATTTCATCGGTCAGGGATGGGGATGCGTCTGATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
3       CAGGCTTAACACATGCAAGTCGAGGGGAAACGACATTGAAGCTTGCTTCGATGGTCGTCGACCGGCGCACGGGTGAGTAACGCGTAAAGAACTTGCCTCTTAGACCGGGACAACATCTGGAAACGGATGCTAATACCGGATATTATGGTTTTTTCGCATGGAGGAATCATGAAAGCTAGATGCGCTAAGAGAGAGCTTTGCGTCCCATTAGCTAGTTGGTGAGGTAACGGCCCACCAAGGCAATGATGGGTAGCCGGCCTGAGAGGGTGAACGGCCACAAGGGGACT
3       CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATGCTTAGCTTGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTCGGGCGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAAGGCGACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT
2       CAGGCTTAACACATGCAAGTCGAGGGGAAACGACATTGAAGCTTGCTTCGATGGTCGTCGACCGGCGCACGGGTGAGTAACGCGTAAAGAACTTGCCTCTTAGACCGGGACAACATCTGGAAACGGATGCTAATACCGGATATTATGGTTTTTTCGCATGGAGGAATCATGAAAGCTAGATGCGCTAAGAGAGAGCTTTGCGTCCCATTAGCTAGTTGGTGAGGTAACGGCCCACCAAGGCAATGATGGGTAGCCGGCCTGAGAAGGTGAACGGCCACAAGGGGACT
2       CAGGCTTAACACATGCAAGTCGTGGGGCAGCGGATGCTTAGCTCGCTAAGTATGCCGGCGACCGGCGCACGGGTGAGTAACGCGTACCGAACCTGCCCATCACACAGGGATAGGCTTGCGAAAGCAAGATTAATACCTGATGGTCTCAGTTGTATGCATGTATAATTGAGTAAAGCCTTCGGGCGGTGATGGATGGCGGTGCGTCCCATTAGGAAGTTGGCGGGGTAACGGCCCACCAAGGCGACGATCAGTAGGGGTTCTGAGAGGAAGGTCCCCCACATTGGAACT

@nhoffman I think is ready for review/approval now