nf-core / riboseq

Pipeline for the analysis of ribosome profiling, or Ribo-seq (also named ribosome footprinting) data.
https://nf-co.re/riboseq
MIT License
4 stars 3 forks source link

Optimise STAR command for riboseq data #30

Open pinin4fjords opened 4 months ago

pinin4fjords commented 4 months ago

Description of feature

I'm noticing a lot of multi-mapping reads and those unmapped as 'too short'. Some of this is just down to the short read lengths I guess, but we should try to add some STAR optimisations. See the STAR MultiQC plot below. The top runs are RNA-seq, bottom Riboseq.

star_alignment_plot-1

JackCurragh commented 2 months ago

I will explore this when I have time. Below are some thoughts on it. Any input would be helpful.

Looks like this param is a bit permissive. I often refer to what ORFik does and they use 10 rather than 20. This will only increase the number mapped to too many loci though.

I also wonder what the trade off is between Local and EndToEnd. Local allows softclipping which is useful if there are untemplated additions or if there was issues with the preprocessing of adapters/UMIs. But I can't see how to limit the number of softclipped bases so maybe we are better off leaving it EndToEnd.

outFilterMismatchNmax is currently default as far as I can see, which is 10. This should be max 3 for Ribo-Seq due to their short read lengths.

I am unclear how we get the 'too short' as it was my understanding that --outFilterMatchNmin controlled this and the default is 0. ORFik have a min length of 20 but it is my understanding that there are potentially valid Ribo-Seq reads as short as 18nt long. Their validity can be determined downstream of alignment based on periodicity etc. Also, maybe this could be used to control the degree of softclipping tolerated.

pinin4fjords commented 2 months ago

Could you PR any modifications you feel may be appropriate, and post the above plot to show the impact of the change with the test or test_full profiles please?

'too short' is a bit misleading with STAR, it doesn't quite mean what it seems.

iraiosub commented 1 month ago

Hey everyone, I’ll chip in case helpful. My apologies @JackCurragh if you’ve already figured these things out for your upcoming PR.

The stringent requirement of EndToEnd alignment likely fails when ends of the reads do not perfectly match the reference, thus "too short". Although --outFilterMatchNmin affects this, you can fine-tune the alignment sensitivity by adjusting several other parameters that control its behaviour: • Reducing --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to accept shorter or less perfect alignments. • Increasing --outFilterMismatchNmax to accommodate more mismatches.

However, these alterations might affect the distribution of reads in other categories as well - i.e. if we allow too many mismatches we might get many more multi-mappers.

I recommend checking the multiQC/STAR log output alongside the BAMs’ riboseq QC metrics to fully assess the impact of any parameter adjustments. I have also noticed high variability between datasets in relation to the STAR settings used, so I do wonder before changing the current settings to look how they behave on other datasets besides the test_full one before fully making the switch? I can help with some of these matters, feel free to message me.

JackCurragh commented 1 month ago

Hi Ira, thanks for the input. I have this ready to PR with the updated params but I agree that it is worth further investigation. Unfortunately I probably won't have time to do this until later this week but will be sure to report back when we have results to discuss.

pinin4fjords commented 1 month ago

I messed around quite a bit (see e.g. https://github.com/nf-core/riboseq/pull/37) and didn't manage to make any improvements to the mapping statistics, so didn't proceed further.

Of course the parameter space is large, so I'm sure there are improvements to be made, we just need to demonstrate the utility.