Human and mouse read disambiguation for PDX samples?

apsteinberg commented 1 month ago

Description of feature

Hi,

Thank you for creating this awesome pipeline. I'm wondering if there are any modules within sarek that do disambiguation of mouse and human reads for PDX samples? For example like the disambiguate tool from Astra Zeneca:

https://github.com/AstraZeneca-NGS/disambiguate

Thanks for your time and help.

Best, Asher

FriederikeHanssen commented 1 month ago

Hey!

Is this related to a similar preprocessing step as requested here: https://github.com/nf-core/sarek/issues/1144 ?

So far we have restrained from expanding the scope of sarek even further to keep the pipeline maintainable. If it is a single tool I am slightly more inclined to have it added. What else would be necessary to make this work in the current workflow?

apsteinberg commented 1 month ago

Hi there,

This is related to the preprocessing step referenced in #1144.

Totally makes sense, I'm sure it takes a lot of time and effort to maintain. I was corresponding with @SPPearce about this on slack (link here), and he has written a subworkflow for this: https://nf-co.re/subworkflows/fastq_align_bamcmp_bwa. It relies on three tools: (i) bwa to align to both references, (ii) bamcmp to keep reads that align to the first genome, and (iii) sam tools to sort.

I haven't tested it out yet, but I think to integrate this for PDX or other samples with contamination this subworkflow would be run in lieu of the fastq_align_bwamem_mem2_dragmap_sentieon and bam_merge_index_samtools subworkflows. It could be an optional flag for these types of samples.

I would also be happy to try writing this in the next couple months, but I am thus far a nextflow novice :)

Thanks for your time and help!

Best, Asher

SPPearce commented 1 month ago

I do think we could do with this ability in some way, whether bamcmp or elsewhere. A suggestion was for a completely separate pipeline for this kind of filtering, generating bam files (or fastq) which then can go into many different pipelines

nf-core / sarek

Human and mouse read disambiguation for PDX samples? #1578

Description of feature