stjudecloud / workflows

Bioinformatics workflows developed for and used on the St. Jude Cloud project.
MIT License
33 stars 10 forks source link

feat: opt-in to run kraken twice, second time on only mapped reads #133

Closed a-frantz closed 6 months ago

a-frantz commented 7 months ago

This might be a dud of an analysis. I've only run it on one sample, a "Cellular Organism" case from the latest RTCG batch, and this extra Kraken run didn't reveal anything interesting. multiqc_report.html.txt

Do we want to merge this now? It works as expected. The analysis is sound in theory. Remains to be seen if it's useful in practice. There's been a test with n of 1 which doesn't tell us much of anything. I vote merge now, with the possibility we revert the commit sometime down the line when we've used it more. I want to see these result before and after XenoCP. That might be neat. Probably some other use cases we will run into in the future(?)

Voting against leaving this PR open, because I want to use it and I don't want to deal with keeping it up to date with main, which has other work going on. If main were stagnant I'd feel differently.

a-frantz commented 7 months ago

@adthrasher I was about to hit merge when I got an idea.

Should the F = "0x904" line be at the workflow input level? (With a different param name) Perhaps all the samtools fastq filters should be brought up to the workflow level? (I think we'd want to obfuscate them in a struct, but I'm still in the "spit-ball" idea phase here.)

My thinking: there are a few tangentially related questions you can ask by filtering-pre-Kraken. I don't want to support them all individually, but we have a mechanism to abstract here and allow the user to construct their own comparative filter.

This PR currently allows you to ask "what species do the mapped reads best map to?" I can certainly see someone wanting to know "what species do the unmapped reads best map to?" A little more niche question, but one I see a use case for.

I haven't worked through all the details yet, but LMK if you think this is worth pursuing. A difficulty here will be sensibly constructing the docs. But we can get through that.

adthrasher commented 7 months ago

I think having a struct to wrap the flag filters is worth pursuing. Right now, you've documented in comments, but if you had the right struct, you wouldn't need the comments. It would also be potentially useful in other filtering.

a-frantz commented 7 months ago

@adthrasher Please pay special attention to data_structures/flag_filter.wdl. Of course we don't need to feel tied to any precedent set here, but it would be nice to get the whole "data_structures structure" right the first time. So let's dot our 'i's and all that to make sure we're happy with this direction.