nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
187 stars 117 forks source link

Option to filter low-abundant ASVs before running QIIME2_ANCOM_ASV #571

Closed SergeWielhouwer closed 1 year ago

SergeWielhouwer commented 1 year ago

Description of feature

Hi,

I am currently running Ampliseq 2.5.0 on a PE 16s V4 data set (16 samples) of 200K reads each using the following command: /mnt/shared/development/16Smicrobiome/nextflow_23.04.0/nextflow run /mnt/shared/development/16Smicrobiome/ampliseq-2.5.0/main.nf -profile singularity --input samplesheet.tsv --FW_primer TATGGTAATTGTGTGCCAGCMGCCGCGGTA --RV_primer GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC --metadata metadata_t0_vs_t4.tsv --retain_untrimmed --outdir results_t0_vs_t4 --illumina_novaseq It is currently still busy at the QIIME2_ANCOM_ASV step for over 7 days now, does anyone know whether this is normal behaviour? I already subsetted to the 200K reads to reduce analysis time as it was a bit oversequenced for this type of application. For the qiime2 feature-table.tsv file I see 55984 different OTU ID's, I guess this is explained by a high taxonomic diversity in my data set, and is this normal?

Is it possible to pre-filter the input of QIIME2_ANCOM_ASV to reduce computational time? I expected the analysis step to finish within 2 days at max.

My apologies if this is not the right place to ask for help.

Best regards,

Serge

d4straub commented 1 year ago

Hi,

thats the right place to ask for help, another possibility is in the nf-core slack, see https://nf-co.re/join.

It is currently still busy at the QIIME2_ANCOM_ASV step for over 7 days now, does anyone know whether this is normal behaviour?

60k ASVs are indeed a bit much. And ANCOM ASV does take long, and with this count of ASVs it could take several days.

I already subsetted to the 200K reads to reduce analysis time as it was a bit oversequenced for this type of application.

yes even 200k might be oversequenced, but I would be fine with this, I would not subset the original reads if not absolutely necessary. I'd rather make quality filtering stricter, that is also expected to reduce false positive ASV. There are so many ways to be strict with data, e.g. not using --retain_untrimmed, using a higher value for https://nf-co.re/ampliseq/2.5.0/parameters#trunc_qmin (I typically use 35), reducing https://nf-co.re/ampliseq/2.5.0/parameters#max_ee (e.g. from default 2 to 1?)

For the qiime2 feature-table.tsv file I see 55984 different OTU ID's, I guess this is explained by a high taxonomic diversity in my data set, and is this normal?

I never had so many ASV for 16 samples. Yes it might be that this is due to diversity, maybe within but possible also between samples. For environmental samples I have typically around 10-20k ASV at most.

Is it possible to pre-filter the input of QIIME2_ANCOM_ASV to reduce computational time? I expected the analysis step to finish within 2 days at max.

Not specifically for QIIME2_ANCOM_ASV, but you could filter after DADA2 for all downstream analysis steps the ASVs by params listed in https://nf-co.re/ampliseq/2.5.0/parameters#asv-filtering. All those could be helpful, but https://nf-co.re/ampliseq/2.5.0/parameters#min_samples might have highest impact by removing ASVs that are only in few samples. I'd put it to the number of replicates, if you have any.

SergeWielhouwer commented 1 year ago

Thank you Daniel! The minimal phred-score and min_samples per ASV look really promising to use. I will definitely consider using these for future runs :).

SergeWielhouwer commented 1 year ago

Small update, the QIIME2_ANCOM_ASV is currently still running after 32 days after having extended the allocated job time limit. I guess additional aggressive pre-processing next to subsampling may be required in some instances 😬.

d4straub commented 1 year ago

Yes, appears so, that seems unbearable! Are the above mentioned solutions not enough? Should there be a specific filter only before ANCOM? I honestly think that ANCOM should be run probably manually in case of large datasets. Probably even better ANCOM-BC. I still think that with 16 samples you should not have so many ASVs and you need to be more careful with QC.

SergeWielhouwer commented 1 year ago

I haven't tried the solutions yet, as I hoped that it would finish one day, mostly because I did not want to filter the dataset further. Sadly it never finished. I will definitely try out the filters next to avoid endless waiting :). I think the dataset I used was flawed anyway, as no differentially abundant taxa were found.

d4straub commented 1 year ago

I close that now, the runtime seemed to be from the overly abundant ASVs. Please feel free to open another issue if there are any other problems.