`overall_summary.tsv` sometimes with misleading numbers in 2.9.0

agrier-wcm commented 6 months ago

Description of the bug

In 2.9.0 (but not previous versions) overall_summary.tsv, the number of denoised/merged/nochim sequences are sometimes higher than the number of filtered sequences. It is about 1/3 of the time in my current dataset. This doesn't make sense and I want to make sure filtering is still working properly.

It does appear that the input_tax_filter is always lower than the number of filtered sequences, but in some cases it is higher than the number of merged and nochim sequences, which also doesn't make sense.

I suspect there is some disconnect here. denoised, merged, and nochim should always be less than or equal to filtered, and input_tax_filter should always be equal to nochim, but neither of these things is always true with the latest version of the pipeline.

Unless there is just a difference in how these values are reported in 2.9.0? But why would input_tax_filter (and, subsequently, filtered_tax_filter) ever be higher than nochim? Regardless of how it's being reported, there should never be more final sequences (filtered_tax_filter) than there are non-chimeric sequences (nochim), right?

In any case, thank you for maintaining this fantastic tool and I apologize if I'm somehow misunderstanding the reporting in this version.

Edit to add: There might be more to this because I checked the results on the nf-core page and they don't exhibit this issue (https://nf-co.re/ampliseq/2.9.0/results/ampliseq/results-717abb8a0372c1821f5837ab3a902be90faf4cba?file=overall_summary.tsv). In my case, primers are not present in the reads. Here is the nf command I used:

nextflow run ampliseq -profile singularity --input SampleSheet.tsv --FW_primer GTGYCAGCMGCCGCGGTAA --RV_primer CCGYCAATTYMTTTRAGTTT --metadata Metadata.tsv --outdir results --email myemail@me.edu --dada_ref_taxonomy silva --dada_taxonomy_rc --ignore_empty_input_files --ignore_failed_trimming --ignore_failed_filtering --min_frequency 10 --retain_untrimmed --trunclenf 240 --trunclenr 160 --metadata_category_barplot Condition --tax_agglom_max 7 --max_memory 32.GB

Edit to add again: I just noticed that for a couple samples, the denoised/merged/nochim counts are higher than the total number of input reads (cutadapt_total_processed). So, something is definitely out of whack. Could it be that the rows of the DADA2 stats are just not sorted in the same way as the rows in the rest of the file are? That would mean it's really just an issue with how overall_summary.tsv is put together and not a more serious underlying issue.

Command used and terminal output

No response

Relevant files

No response

System information

No response

agrier-wcm commented 6 months ago

I think it is a sorting issue. The nochim values in overall_summary.tsv and in dada2/DADA2_stats.tsv both match the column sums in dada2/DADA2_table.tsv, in the order in which they occur in that table. However, the order of the columns in that table does not match the order of the samples in overall_summary.tsv and dada2/DADA2_stats.tsv.

d4straub commented 6 months ago

Thanks for the report and the additional details! Could it be related to the addition of radix sorting in https://github.com/nf-core/ampliseq/issues/706? If yes, we either have to use radix sorting everywhere or be stricter with sampleID. What where the sampleID of the samples that had wrong numbers?

agrier-wcm commented 6 months ago

Here is the order of the sample IDs in overall_summary.tsv:

blank_1_16S_Re_Seq blank_2_16S_Re_Seq blank_3_16S_Re_Seq LEFT_EMPTY_16S_Re_Seq negative_control_100_16S_Re_Seq negative_control_108_16S_Re_Seq S_190_16S_Re_Seq S_191_16S_Re_Seq S_238_16S_Re_Seq S_24_16S_Re_Seq S_250_16S_Re_Seq S_284_16S_Re_Seq S_300_16S_Re_Seq S_301_16S_Re_Seq S_302_16S_Re_Seq S_303_16S_Re_Seq S_304_16S_Re_Seq

And here is the order of the sample IDs (as columns) in dada2/DADA2_table.tsv:

LEFT_EMPTY_16S_Re_Seq S_190_16S_Re_Seq S_191_16S_Re_Seq S_238_16S_Re_Seq S_24_16S_Re_Seq S_250_16S_Re_Seq S_284_16S_Re_Seq S_300_16S_Re_Seq S_301_16S_Re_Seq S_302_16S_Re_Seq S_303_16S_Re_Seq S_304_16S_Re_Seq blank_1_16S_Re_Seq blank_2_16S_Re_Seq blank_3_16S_Re_Seq negative_control_100_16S_Re_Seq negative_control_108_16S_Re_Seq

Within overall_summary.tsv the values are correct for all columns except denoised, merged, and nochim. The values in those four columns (denoised has F and R) are sorted like the sample IDs in dada2/DADA2_table.tsv, so those values are all incorrect.

I think the radix sorting is a likely candidate. When I use default sort on this list of sample IDs, I get the ordering thats in overall_summary.tsv. When I sort with method = "radix" I get the ordering of the columns in dada2/DADA2_table.tsv. So, if we're sorting with the default method in one place and with radix in another place and then combining the results without matching row names, that would definitely cause exactly the issue we're observing here.

I haven't checked how exactly overall_summary.tsv is put together (is it just pasting columns together?), but can we merge on row names? I'm not familiar enough to say what the most straightforward and least disruptive fix would be.

d4straub commented 6 months ago

DADA2 logs are merged here by dada2_stats.nf. Furthermore, multiple sequencing runs are merged here by dada2_merge.nf. Finally, overall_summary.tsv is produced here by merging via sample names (see merge_stats.nf).

I assume dada2_stats.nf is the problem. cbind in https://github.com/nf-core/ampliseq/blob/717abb8a0372c1821f5837ab3a902be90faf4cba/modules/local/dada2_stats.nf#L51 might be problematic. That code is very old, was taken at the time from the DADA2 tutorial. Maybe the data that is merged there will need to be sorted appropriately or merged via sample IDs. Could you test/confirm whether that part is indeed the problem?

d4straub commented 5 months ago

Also seems to appear with simple sample names such as sample1 sample10 sample2 sample20 definitively a sorting problem, will need to sort that out for the next release

agrier-wcm commented 5 months ago

This seems to do the trick:

Change this:

        if ( nrow(filter_and_trim) == 1 ) {
            track <- cbind(filter_and_trim, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
        } else {
            track <- cbind(filter_and_trim, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
        }

To this:

        if ( nrow(filter_and_trim) == 1 ) {
            track <- cbind(filter_and_trim[order(rownames(filter_and_trim), method = "radix"), ], getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
        } else {
            track <- cbind(filter_and_trim[order(rownames(filter_and_trim), method = "radix"), ], sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
        }

In dada2_stats.nf.

That works for the overall_summary.tsv issue. @d4straub can you think of anything else that this sorting issue might affect? I'm not aware of anything but I also haven't really examined every detail of the output to look for mismatches. All of the qiime2 stuff should be unaffected, right?

I don't think it makes much difference, but you shouldn't need to actually change anything in the case where the if clause is true because it's only one sample, so there's nothing to sort.

d4straub commented 5 months ago

I don't think it makes much difference, but you shouldn't need to actually change anything in the case where the if clause is true because it's only one sample, so there's nothing to sort.

Indeed, that part shouldn't need the sorting, 1 entry isnt really worth ordering ;)

can you think of anything else that this sorting issue might affect?

No, I think all of the other stuff is proper merging with IDs, so there shouldnt be a problem any more. Thanks for testing that out. Once https://github.com/nf-core/ampliseq/pull/747 is merged it would be time to fix that. Would you like to do a PR?

d4straub commented 5 months ago

I opened a PR correcting the issue in https://github.com/nf-core/ampliseq/pull/750. I thought that sorting all tables before merging might be even better than sorting just one, because it might prevent any new problems when changing table sorting in future without being aware of the cbind in that module. Let me know what you think.

erikrikarddaniel commented 5 months ago

Sounds like a good suggestion, although I'm not very fond of assuming tables are sorted correctly. I always use dplyr's inner_join() etc. to join on a key.

d4straub commented 5 months ago

Yes, joining on a key would be certainly best. But the file names (that determine row names in the tables) are very inconsistent at that point as mentioned in

I also considered correcting the row names for each table and subsequently apply merge, but because row names are so divers, that seems not great. I do have the feeling correcting row names and use merge might be safer, but I couldnt find any example where it would matter, but I am open to change the implementation.

but I didnt consider changing the file names earlier to make them more uniform and then be able to merge with keys. Maybe that would be actually the best way. Will think about it.

Edit: Using more streamlined file naming seems possible but also a huge hassle with close to no benefit at that point. Column binding (as was previously) worked for years, and now all tables are sorted before column binding to ensure identical sample sequence. So for now I wont change that.

d4straub commented 5 months ago

Ok the fix in dev now, I hope it wont make any problem in the foreseeable future.

nf-core / ampliseq