Multiple entropy peaks with PacBio CCS reads

pneumowidow commented 5 years ago

Hi,

I recently came across oligotyping and decided to try it for a bacterial mock community I sequenced via Pacbio Sequel platform. I have to state that my amplicons are NOT 16S, but a non-coding region of ~1.3kb. This region is highly polymorphic for our bacterial strains and can differ by 3 to 20 SNPS. Obviously, given the length of the region and the fact that we want to resolve communities in high depth mixtures (1:99), Pacbio (instead of Illumina) was the best choice.

In brief, the raw subreads were aligned to a reference and then CCS reads with accuracy of 99% were generated. These were then aligned again to the same reference and were converted from bam to fastq. DADA2 was initially used to try to resolve our mock communities, but we saw that it was limited in strain mixtures higher than 1:49 ratio. Basically, we couldn't resolve the least abundant communities in 1:60..1:99 mixtures.

Since DADA2 already trimmed off primers and quality filtered based on length and QV scores, I decided to take that output and use it as an input in your pipeline. Mostly because I saw that you were able to resolve oligotypes with as little as 2 SNPs. Since my DADA2 output was a fastq, I converted it to fasta and followed your suggestion of multiple sequence alignment (MSA) with PyNAST and afterwards used the o-trim-uninformative-columns-from-alignment script to remove uninformative gaps. I assessed my reads before and after and they looked fine. I still had some indels from the sequencing (as shown below) but otherwise, it was ok.

I then ran the entropy-analysis command to view entropy rich regions and in my snapshot I got the following below:

Firstly, there are too many positions than expected and secondly, the highest entropy value of 1.7 doesn't make sense because this community in question contains 2 bacterial strains in a 1:9 ratio which differ by 17 SNPs. I guess in theory, I expected to see 17 high entropy regions, but I have TOO many. This is a mock community that I was able to get accurate ASVs with DADA2, so I'm completely surprised.

I guess my question is, can I improve my output in any way? Or is this just a SAMPLE ISSUE because my sequences were generated from PACBIO and regardless of them being very accurate CCS reads, the presence of some indels may be affecting the ability of the entropy-analysis program to generate a clean, entropy distributed output that directly reflects the heterogeneity in the sample?

I would like to hear your thoughts on this. Thank you

meren commented 5 years ago

I can see from your screenshot that your sequences are very poorly aligned, and it a large number high entropy positions are rather expected. I don't think PyNAST is a good choice for aligning these reads, again, looking at the screenshot you sent, and perhaps you should give muscle a go.

Best,

pneumowidow commented 5 years ago

Hi Meren,

Thanks for the quick response. I actually started with muscle given the length of my reads and ran the entropy-analysis command, but I got the same result. That's why I decided to follow your recommended steps as detailed in your tutorial. Yes, with Muscle I have a far better alignment but I get the same entropy output regardless. So does this mean that if I'm looking at higher number of variable positions, then oligotyping would not be beneficial?

Thank you

meren commented 5 years ago

Yes, with Muscle I have a far better alignment but I get the same entropy output regardless.

I am not sure how it is possible to get the same entropy output with a far better alignment. The alignment in the screenshot is simply not a proper alignment and contains multiple reads with 1- or 2-base shifts to either directions.

So does this mean that if I'm looking at higher number of variable positions, then oligotyping would not be beneficial?

Your data seem to have lots of homopolymer-region associated in/dels. Oligotyping could be beneficial to find those among high number or variable positions that carry out more biologically relevant signal.

pneumowidow commented 5 years ago

Dear Meren,

I am not sure how it is possible to get the same entropy output with a far better alignment. The alignment in the screenshot is simply not a proper alignment and contains multiple reads with 1- or 2-base shifts to either directions.

By the same entropy output, I mean that my highest entropy was still 1.0 regardless of the alignment software I used and this entropy value wasn't discernibly higher than the other values that came afterwards. Also, the positions at entropy of 1.0 (for both muscle and PyNast) are not the expected positions for the SNPs (which are known to be present there). Below is a print out of the entropy positions. Sorry, I don't have a graphical print out because the Linux server I was using, didn't support one.

Your data seem to have lots of homopolymer-region associated in/dels. Oligotyping could be beneficial to find those among high number or variable positions that carry out more biologically relevant signal.

Yeah, I had a feeling the in/dels would be my main Achilles heel with your software. Well, thanks anyway and I really appreciate the quick responses.

merenlab / oligotyping

Multiple entropy peaks with PacBio CCS reads #29