Closed agrier-wcm closed 5 months ago
I think it is a sorting issue. The nochim
values in overall_summary.tsv
and in dada2/DADA2_stats.tsv
both match the column sums in dada2/DADA2_table.tsv
, in the order in which they occur in that table. However, the order of the columns in that table does not match the order of the samples in overall_summary.tsv
and dada2/DADA2_stats.tsv
.
Thanks for the report and the additional details! Could it be related to the addition of radix sorting in https://github.com/nf-core/ampliseq/issues/706? If yes, we either have to use radix sorting everywhere or be stricter with sampleID
. What where the sampleID
of the samples that had wrong numbers?
Here is the order of the sample IDs in overall_summary.tsv
:
blank_1_16S_Re_Seq blank_2_16S_Re_Seq blank_3_16S_Re_Seq LEFT_EMPTY_16S_Re_Seq negative_control_100_16S_Re_Seq negative_control_108_16S_Re_Seq S_190_16S_Re_Seq S_191_16S_Re_Seq S_238_16S_Re_Seq S_24_16S_Re_Seq S_250_16S_Re_Seq S_284_16S_Re_Seq S_300_16S_Re_Seq S_301_16S_Re_Seq S_302_16S_Re_Seq S_303_16S_Re_Seq S_304_16S_Re_Seq
And here is the order of the sample IDs (as columns) in dada2/DADA2_table.tsv
:
LEFT_EMPTY_16S_Re_Seq S_190_16S_Re_Seq S_191_16S_Re_Seq S_238_16S_Re_Seq S_24_16S_Re_Seq S_250_16S_Re_Seq S_284_16S_Re_Seq S_300_16S_Re_Seq S_301_16S_Re_Seq S_302_16S_Re_Seq S_303_16S_Re_Seq S_304_16S_Re_Seq blank_1_16S_Re_Seq blank_2_16S_Re_Seq blank_3_16S_Re_Seq negative_control_100_16S_Re_Seq negative_control_108_16S_Re_Seq
Within overall_summary.tsv
the values are correct for all columns except denoised
, merged
, and nochim
. The values in those four columns (denoised has F and R) are sorted like the sample IDs in dada2/DADA2_table.tsv
, so those values are all incorrect.
I think the radix sorting is a likely candidate. When I use default sort on this list of sample IDs, I get the ordering thats in overall_summary.tsv
. When I sort with method = "radix"
I get the ordering of the columns in dada2/DADA2_table.tsv
. So, if we're sorting with the default method in one place and with radix in another place and then combining the results without matching row names, that would definitely cause exactly the issue we're observing here.
I haven't checked how exactly overall_summary.tsv
is put together (is it just pasting columns together?), but can we merge on row names? I'm not familiar enough to say what the most straightforward and least disruptive fix would be.
DADA2 logs are merged here by dada2_stats.nf. Furthermore, multiple sequencing runs are merged here by dada2_merge.nf. Finally, overall_summary.tsv
is produced here by merging via sample names (see merge_stats.nf).
I assume dada2_stats.nf is the problem. cbind
in https://github.com/nf-core/ampliseq/blob/717abb8a0372c1821f5837ab3a902be90faf4cba/modules/local/dada2_stats.nf#L51 might be problematic. That code is very old, was taken at the time from the DADA2 tutorial. Maybe the data that is merged there will need to be sorted appropriately or merged via sample IDs. Could you test/confirm whether that part is indeed the problem?
Also seems to appear with simple sample names such as sample1 sample10 sample2 sample20 definitively a sorting problem, will need to sort that out for the next release
This seems to do the trick:
Change this:
if ( nrow(filter_and_trim) == 1 ) {
track <- cbind(filter_and_trim, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
} else {
track <- cbind(filter_and_trim, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
}
To this:
if ( nrow(filter_and_trim) == 1 ) {
track <- cbind(filter_and_trim[order(rownames(filter_and_trim), method = "radix"), ], getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
} else {
track <- cbind(filter_and_trim[order(rownames(filter_and_trim), method = "radix"), ], sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
}
In dada2_stats.nf.
That works for the overall_summary.tsv
issue. @d4straub can you think of anything else that this sorting issue might affect? I'm not aware of anything but I also haven't really examined every detail of the output to look for mismatches. All of the qiime2 stuff should be unaffected, right?
I don't think it makes much difference, but you shouldn't need to actually change anything in the case where the if
clause is true because it's only one sample, so there's nothing to sort.
I don't think it makes much difference, but you shouldn't need to actually change anything in the case where the if clause is true because it's only one sample, so there's nothing to sort.
Indeed, that part shouldn't need the sorting, 1 entry isnt really worth ordering ;)
can you think of anything else that this sorting issue might affect?
No, I think all of the other stuff is proper merging with IDs, so there shouldnt be a problem any more. Thanks for testing that out. Once https://github.com/nf-core/ampliseq/pull/747 is merged it would be time to fix that. Would you like to do a PR?
I opened a PR correcting the issue in https://github.com/nf-core/ampliseq/pull/750. I thought that sorting all tables before merging might be even better than sorting just one, because it might prevent any new problems when changing table sorting in future without being aware of the cbind in that module. Let me know what you think.
Sounds like a good suggestion, although I'm not very fond of assuming tables are sorted correctly. I always use dplyr's inner_join()
etc. to join on a key.
Yes, joining on a key would be certainly best. But the file names (that determine row names in the tables) are very inconsistent at that point as mentioned in
I also considered correcting the row names for each table and subsequently apply merge, but because row names are so divers, that seems not great. I do have the feeling correcting row names and use merge might be safer, but I couldnt find any example where it would matter, but I am open to change the implementation.
but I didnt consider changing the file names earlier to make them more uniform and then be able to merge with keys. Maybe that would be actually the best way. Will think about it.
Edit: Using more streamlined file naming seems possible but also a huge hassle with close to no benefit at that point. Column binding (as was previously) worked for years, and now all tables are sorted before column binding to ensure identical sample sequence. So for now I wont change that.
Ok the fix in dev now, I hope it wont make any problem in the foreseeable future.
Description of the bug
In 2.9.0 (but not previous versions) overall_summary.tsv, the number of
denoised
/merged
/nochim
sequences are sometimes higher than the number offiltered
sequences. It is about 1/3 of the time in my current dataset. This doesn't make sense and I want to make sure filtering is still working properly.It does appear that the
input_tax_filter
is always lower than the number offiltered
sequences, but in some cases it is higher than the number ofmerged
andnochim
sequences, which also doesn't make sense.I suspect there is some disconnect here.
denoised
,merged
, andnochim
should always be less than or equal tofiltered
, andinput_tax_filter
should always be equal tonochim
, but neither of these things is always true with the latest version of the pipeline.Unless there is just a difference in how these values are reported in 2.9.0? But why would
input_tax_filter
(and, subsequently,filtered_tax_filter
) ever be higher thannochim
? Regardless of how it's being reported, there should never be more final sequences (filtered_tax_filter
) than there are non-chimeric sequences (nochim
), right?In any case, thank you for maintaining this fantastic tool and I apologize if I'm somehow misunderstanding the reporting in this version.
Edit to add: There might be more to this because I checked the results on the nf-core page and they don't exhibit this issue (https://nf-co.re/ampliseq/2.9.0/results/ampliseq/results-717abb8a0372c1821f5837ab3a902be90faf4cba?file=overall_summary.tsv). In my case, primers are not present in the reads. Here is the nf command I used:
nextflow run ampliseq -profile singularity --input SampleSheet.tsv --FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer CCGYCAATTYMTTTRAGTTT --metadata Metadata.tsv --outdir results --email myemail@me.edu --dada_ref_taxonomy silva --dada_taxonomy_rc --ignore_empty_input_files --ignore_failed_trimming --ignore_failed_filtering --min_frequency 10 --retain_untrimmed --trunclenf 240 --trunclenr 160 --metadata_category_barplot Condition --tax_agglom_max 7 --max_memory 32.GB
Edit to add again: I just noticed that for a couple samples, the
denoised
/merged
/nochim
counts are higher than the total number of input reads (cutadapt_total_processed
). So, something is definitely out of whack. Could it be that the rows of the DADA2 stats are just not sorted in the same way as the rows in the rest of the file are? That would mean it's really just an issue with howoverall_summary.tsv
is put together and not a more serious underlying issue.Command used and terminal output
No response
Relevant files
No response
System information
No response