nf-core / seqinspector

QC pipeline to inspect your sequences

https://nf-co.re/seqinspector

MIT License

3 stars 13 forks source link

Add modules outlined in the pipeline proposal #8

Open kedhammar opened 3 months ago

kedhammar commented 3 months ago

Functionalities and modules

Mentioned in the pipeline proposal

primaryQC_pipeline_proposal.pdf

Pipeline proposal Slack thread

Standard QC

[x] FastQC Standard QC
[ ] SeqKit histograms
- I assume this refers to seqkit watch
- Not available as a module currently, what I can see
[ ] SeqFu
- Not available as a module currently, what I can see
- [ ] fastq integrity --> seqfu check
- [ ] metadata --> seqfu metadata
- [ ] merging lanes --> seqfu merge

Duplication + Complexity

[ ] Preseq complexity
- Which subtool?
[ ] BBtools Clumpify
[ ] UMI detection (stretch goal)

Adapter and Artifact detection

[ ] Fastp
[ ] BBtools
- BBDuk
- Testformat2
- (RQCFilter2 is a corresponding subworkflow using multiple BBtools)
- (For PacBio: Removesmartbell, Icecreamfinder)
Contamination detection
[ ] FastQ screen
[ ] Sylph
[ ] Kraken2
[ ] Mapping to reference

Mentioned in the pipeline Slack channel

[ ] Mash screen
[ ] checkQC

kedhammar commented 3 months ago

6 PR draft to start adding modules

mahesh-panchal commented 3 months ago

For WGS data for assembly, GenomeScope (https://github.com/nf-core/modules/blob/master/modules/nf-core/genomescope2/main.nf). The database is built using Meryl ( also on nf-core ).

But there is also a container only version that's a little bit faster and has extra tools that might be useful (https://github.com/nf-core/modules/blob/master/modules/nf-core/genescopefk/main.nf) The databases for Merquryfk/KATGC, Merquryfk/KATCOMP, Merqury/Ploidyplot, and GeneScopefk are build using FastK.

remiolsen commented 3 months ago

Preseq complexity (which subtool?).

I've used preseq lc_extrap before and there's a module for it in nf-core (https://nf-co.re/modules/preseq_lcextrap). However, it is very prone to not working or rather refusing to give a complexity estimate.

Another option would be Picard (https://gatk.broadinstitute.org/hc/en-us/articles/360037591931-EstimateLibraryComplexity-Picard). I've never used it, and for the applications I worry about library complexity (HiC) the tool I use (pairtools) implemented it's own complexity estimate, so I have no need. There's no nf-core module for it as far as I can see.

kedhammar commented 3 months ago

Preseq complexity (which subtool?).

I've used preseq lc_extrap before and there's a module for it in nf-core (https://nf-co.re/modules/preseq_lcextrap). However, it is very prone to not working or rather refusing to give a complexity estimate.

Another option would be Picard (https://gatk.broadinstitute.org/hc/en-us/articles/360037591931-EstimateLibraryComplexity-Picard). I've never used it, and for the applications I worry about library complexity (HiC) the tool I use (pairtools) implemented it's own complexity estimate, so I have no need. There's no nf-core module for it as far as I can see.

@remiolsen any idea why preseq lc_extrap tends to refuse?

remiolsen commented 3 months ago

@remiolsen any idea why preseq lc_extrap tends to refuse?

I'm fairly certain I used to see this error most commonly - and I quote from the preseq manual

Q — When running lc extrap, I receive the error
ERROR: too many iterations, poor sample

A. — Most commonly this is due to the presence of defects in the approximation which cause the
estimates to be unstable. Setting the step size larger (with the flag -s) will help to avoid the
defects. The default step size is 1M reads or 0.05% of the input sample size rounded up to the
nearest million, whichever is larger. A consequence of this action will be a reduction in the
observed smoothness of the curve.

And setting the step -s flag was a little bit hit or miss if it worked.

kedhammar commented 1 month ago

Closed https://github.com/nf-core/seqinspector/pull/6 due to being too broad and unspecific. Feel free to start new PRs addressing more specific implementations.

nf-core / seqinspector

Add modules outlined in the pipeline proposal #8

Functionalities and modules

Mentioned in the pipeline proposal

Standard QC

Duplication + Complexity

Adapter and Artifact detection

Contamination detection

Mentioned in the pipeline Slack channel

6 PR draft to start adding modules