ksahlin / isONclust

De novo clustering of long transcript reads into genes
GNU General Public License v3.0
48 stars 8 forks source link

homopolymer compression? #19

Open nextgenusfs opened 2 years ago

nextgenusfs commented 2 years ago

Thanks @ksahlin for the great tool. I've been trying to use for a variety of clustering tasks using some amplicon data. I still have some uncertainty on how it is performing as I don't necessarily know exactly I should be expecting in my amplicon pool, but I do know that I see some "lumping" of closely related sequences that I would like to see come out in separate clusters.

So we are running all of our amplicons with the newer R10 flow cells and use high accuracy basecalling model -- the goal is to reduce errors in homopolymers. I notice this is different than at least your test case (using fast basecalling model). So as I looked at the code I see that the first step in clustering is actually to compress homopolymers: https://github.com/ksahlin/isONclust/blob/master/modules/cluster.py#L209. So I'm wondering what would happen if homopolymers were not compressed? Is compressing homopolymers required here just for kmer selection or is it being used because of higher error rates in these sequences?

Related point is that the error profiles are quite different between the ONT guppy basecall models, so I'm also then wondering about the role of the empirical distances that seem to be hard coded into isONclust -- would the basecall model (ie single read accuracy) alter these distances? I guess in other words should I expect differing performance using the FAST vs HAC basecall models? And if I try to remove the homopolymer compression step as a test, would that cause a problem?

ksahlin commented 2 years ago

Hi @nextgenusfs,

Q1: Homopolymer compression mainly helps for transcriptome data as it removes polyA tails.

I'm fairly sure you can simply change one line (274) here from

seq_hpol_comp = ''.join(ch for ch, _ in itertools.groupby(seq))

to

seq_hpol_comp = seq

and isONclust should still work but without homopolymer compression.

Q2: The error probabilities are precomputed and hardcoded. They are agnostic to the model the basecaller uses. As long as the base caller gives roughly consistent Phred quality scores with the actual error rate, it should be fine regardless of the basecaller used.

Best, Kristoffer

nextgenusfs commented 2 years ago

Fantastic thanks -- I'll give it a try and see if I find an improvement.

nextgenusfs commented 2 years ago

Brilliant -- I just needed to also edit at the qualcomp lines below -- ie the hp_compressed error rate function. Hopefully that looks correct to you? https://github.com/nextgenusfs/isONclust/blob/master/modules/cluster.py#L291-L314

ksahlin commented 2 years ago

Ah, don't know how I missed that, I was looking for the quality values too.. Sorry about that.

Yes, your fix looks good from what I can see.