Optimise STAR command for riboseq data

pinin4fjords commented 4 months ago

Description of feature

I'm noticing a lot of multi-mapping reads and those unmapped as 'too short'. Some of this is just down to the short read lengths I guess, but we should try to add some STAR optimisations. See the STAR MultiQC plot below. The top runs are RNA-seq, bottom Riboseq.

star_alignment_plot-1

JackCurragh commented 2 months ago

I will explore this when I have time. Below are some thoughts on it. Any input would be helpful.

Looks like this param is a bit permissive. I often refer to what ORFik does and they use 10 rather than 20. This will only increase the number mapped to too many loci though.

I also wonder what the trade off is between Local and EndToEnd. Local allows softclipping which is useful if there are untemplated additions or if there was issues with the preprocessing of adapters/UMIs. But I can't see how to limit the number of softclipped bases so maybe we are better off leaving it EndToEnd.

outFilterMismatchNmax is currently default as far as I can see, which is 10. This should be max 3 for Ribo-Seq due to their short read lengths.

I am unclear how we get the 'too short' as it was my understanding that --outFilterMatchNmin controlled this and the default is 0. ORFik have a min length of 20 but it is my understanding that there are potentially valid Ribo-Seq reads as short as 18nt long. Their validity can be determined downstream of alignment based on periodicity etc. Also, maybe this could be used to control the degree of softclipping tolerated.

pinin4fjords commented 2 months ago

Could you PR any modifications you feel may be appropriate, and post the above plot to show the impact of the change with the test or test_full profiles please?

'too short' is a bit misleading with STAR, it doesn't quite mean what it seems.

iraiosub commented 1 month ago

Hey everyone, I’ll chip in case helpful. My apologies @JackCurragh if you’ve already figured these things out for your upcoming PR.

I agree w reducing the number of multi-mappers.
The most likely cause for the high fraction of “too short” reads is the --alignEndsType EndToEnd. I however, think this is the most appropriate default setting for ribo-seq data given their short length, and the fact that the read length encodes important positional information related to ribosomal decoding (unlike RNA-seq reads), and the degree of soft-clipping is hard to control. This precision is important for accurately determining the position of ribosomes on mRNAs. I would trade-off specificity over sensitivity, but that depends on what downstream analyses the data will be used for.

The stringent requirement of EndToEnd alignment likely fails when ends of the reads do not perfectly match the reference, thus "too short". Although --outFilterMatchNmin affects this, you can fine-tune the alignment sensitivity by adjusting several other parameters that control its behaviour: • Reducing --outFilterScoreMinOverLread and --outFilterMatchNminOverLread to accept shorter or less perfect alignments. • Increasing --outFilterMismatchNmax to accommodate more mismatches.

However, these alterations might affect the distribution of reads in other categories as well - i.e. if we allow too many mismatches we might get many more multi-mappers.

I recommend checking the multiQC/STAR log output alongside the BAMs’ riboseq QC metrics to fully assess the impact of any parameter adjustments. I have also noticed high variability between datasets in relation to the STAR settings used, so I do wonder before changing the current settings to look how they behave on other datasets besides the test_full one before fully making the switch? I can help with some of these matters, feel free to message me.

JackCurragh commented 1 month ago

Hi Ira, thanks for the input. I have this ready to PR with the updated params but I agree that it is worth further investigation. Unfortunately I probably won't have time to do this until later this week but will be sure to report back when we have results to discuss.

pinin4fjords commented 1 month ago

I messed around quite a bit (see e.g. https://github.com/nf-core/riboseq/pull/37) and didn't manage to make any improvements to the mapping statistics, so didn't proceed further.

Of course the parameter space is large, so I'm sure there are improvements to be made, we just need to demonstrate the utility.

nf-core / riboseq

Optimise STAR command for riboseq data #30

Description of feature