stjudecloud / workflows

Bioinformatics workflows developed for and used on the St. Jude Cloud project.
MIT License
34 stars 10 forks source link

Add a read trimming step to alignment workflows #179

Open a-frantz opened 2 months ago

a-frantz commented 2 months ago

Currently, our workflows assume read trimming has already occurred upstream, so we don't perform it as part of alignment. This assumption is often violated.

As part of this issue, we need to select a read trimming tool/algorithm (might require some comparative analysis) and then incorporate it into the *-core workflows. We also need to ensure there's no harm in read trimming FASTQs that have already been read trimmed.

If we opt to investigate multiple read trimming tools, we might as well write WDL tasks for all of them. It could be nice if users could select that as part of the workflow, however we may find that they are not all created equal and only one choice should be supported. TBD.

mjgattas commented 5 days ago

I'd love to take this one on! it sounds like from our conversation trimmomatic is the tool you've started investigating, but should I continue the comparative analysis?

a-frantz commented 5 days ago

First step is just going to be getting a working WDL implementation of trimmomatic. We can discuss next steps after that's complete.

I've only skimmed the documentation, but looks like trimmomatic has two modes: a Single-End (SE) and a Paired-End (PE) mode. So we are going to want those each as their own WDL task. Dive into the documentation and expose as many of the parameters as you can. Make sure to copy and paste (with possible editorializing) any relevant bits of the documentation into the WDL meta sections. Our goal in terms of documentation is to provide an equivalent, if not enhanced, experience compared to reading the original docs. Check out the other task files to see how our documentation conventions and do your best to copy them.

I recommend installing the sprocket VSCode extension and using that for writing this. Enable lints and follow any directions from sprocket. (Except for ContainerValue and TrailingComma which we are currently ignoring in this repo)

Then grab some FASTQ files and start testing! Run your tasks using miniwdl (short guide here).

Lastly add some test coverage under the tests/ directory (should be clear how to do that from the existing tests).

Once all the above is looking good, you can ping me and @adthrasher to review the PR.

It would also be great if you could answer this question for us:

We also need to ensure there's no harm in read trimming FASTQs that have already been read trimmed.

For this just run the output through as input and check for differences. We hope there won't be any, but that needs to be investigated!