ksahlin / ultra

Long-read splice alignment with high accuracy
60 stars 10 forks source link

How to control minimap 2 parameters during uLTRA alignment #6

Closed pre-mRNA closed 2 years ago

pre-mRNA commented 3 years ago

Hi,

I understand uLTRA relies on mm2 when aligning to unannotated regions. How can I control minimap2 mapping parameters?

Cheers,

ksahlin commented 3 years ago

Hi,

There are unfortunately no parameters to do this. I can add it to the next version. You can however do it "manually".

  1. Locate the file prefilter_genomic_reads.py For example, do the following:
(ultra) [kris@rackham3 ~]$ which uLTRA 
/proj/snic2020-15-201/anaconda3/envs/ultra/bin/uLTRA

Then I have file prefilter_genomic_reads.py in

/proj/snic2020-15-201/anaconda3/envs/ultra/lib/python3.8/site-packages/modules/prefilter_genomic_reads.py
  1. Change line 32 in prefilter_genomic_reads.py

You can manually edit the prefilter_genomic_reads.py file on line 32 (the call to minimap2).

For instance, if you want to set -w to 5 and k to 13, you can do:

subprocess.check_call([ 'minimap2',  '-ax', 'splice', '--eqx' , '-k', '13, '-w', '5',  '-t', str(nr_cores), refs_path, read_path], stdout=output_file, stderr=stderr_file)
pre-mRNA commented 3 years ago

Thanks for your quick response.

It would be great to be able to control the minimap2 flags in a future release of uLTRA.

I am a big fan of uLTRA by the way. We have done extensive benchmarking for aligning direct RNA reads and uLTRA outperforms minimap2 in all contexts we have tested. Cheers.

ksahlin commented 3 years ago

Ok, I will do that.

I'm really glad to hear that! Especially since we did not include dRNA sequencing in our evaluations.

pre-mRNA commented 3 years ago

Thanks, I had one last question. It says that uLTRA compares mm2 and uLTRA CIGAR strings to decide on the 'best' alignment for a read.

I am wondering what parameters are used in this comparison? Does the algorithm filter for the simplest CIGAR? Or the most aligned bases?

ksahlin commented 3 years ago

Good question. The computation is currently simple:

Let ultra_matches and mm2_matches be all the matching nucleotides in the alignment of uLTRA and minimap2, respectively. Similarly, let ultra_diff and mm2_diff be the sum of all substitutions, deletions, and insertions in the alignment of uLTRA and minimap2, respectively.

The, the computation is as follows: S = (ultra_matches - ultra_diff) - (mm2_matches - mm2_diff). uLTRA chooses the minimap2 alignment if S is negative.

I should mention that softclips (at either 5' or 3' ends) are disregarded in the above computations. I observed that minimap2 is very good at softclipping at the right positions and may therefore get a better score according to the computation above, even though they are aligned to identical splice sites (uLTRA extends the alignment out fully in ends). The reason I mention this is that uLTRA prints a table at the end of the alignment stage that states how many alignments were preferred for either aligner. This table is not accurate for low score differences (less than 5-10 score difference) because of the above reason. There is certainly room to explore alternative decision-making on the final alignment.

pre-mRNA commented 2 years ago

Interesting, especially the last part about softclipping at read-ends.

When using direct RNA reads, one would expect that there is no need to softclip at the 3' end of the read (except for sometimes a few nt, representing sequencing/basecalling errors) since this represents the bona fide end of the RNA molecule.

Cheers!