Align pacbio - Githubissues

Bernadetadad commented 4 years ago

Aligning pacbio reads to flu references.

@jbloom when I'm aligning the reads there seems to be some issue in my annotation because [most reads end up not aligning](notebooks/align_pacbio.py.ipynb). The most common reasons are 5' and 3' clip for sequenced_mRNA_1 and sequenced_mRNA_2, respectively. Maybe I don't totally understand how termini5 and termini3 need to be defined. At the moment I have these defined as sequences expected at the ends of length amplicons, but are these perhaps supposed to be termini of what is annotated as sequenced_mRNA_1/2?

jbloom commented 4 years ago

@Bernadetadad, I may be slightly confused about what I am supposed to be looking at as you have two Genbank files in your data directory for the PacBio sequencing, one of which is described in the README and the other of which is used by the notebook. What I'm posting below is based on looking at fluCA09.gb.

But as best I can tell from looking for a bit, several things confuse me:

What exactly are termini3 and termini5 even supposed to represent in your Genbank files? They appear to completely overlap the regions you define as the coding mRNA. If your amplicons start and end exactly on the viral mRNA, then why are you defining separate "termini"? If your amplicons don't begin and end exactly on the viral mRNA, then the termini should be separate from the viral mRNA. In particular, it doesn't make any sense to define for instance termini5 and sequenced_mRNA_1 as having same 5' termini, and then allowing 5' clipping only on the former. If the former is clipped, then of course the latter will be too as they have the same 5' termini?
It seems to me that also a lot of your reads just aren't mapping at all to any of your targets ("unmapped"). Have you looked to see what these are and if they contain influenza sequence?
You should be able to extract the names of specific sequences that fail specific filters and look at them manually to see where things are going wrong.

Bernadetadad commented 4 years ago

So, I thought that termini5 and termini3 are supposed to be the ends of amplicons (meaning sequences at the end of the final PCR product, which are the sites where the primer primers that linearize the circular templates anneal). Which is why my termini sequences are in the coding regions because my primers that linearize the circular template are just at the start of the coding region. Or are these supposed to be just in general: termini5 == 5' of the mRNA and termini3 == read1 site (so the structure of RNA that is formed after the cDNA amplification in 10x protocol). attached is how the amplicon looks after final PCR for PB2 enrichment Read1 and 5' terminus of mRNA in this case would be next to each other (primers are in violet).

Bernadetadad commented 4 years ago

Also, @jbloom is there a way I can get alignparse to make a csv with a list of unmapped read IDs, similar to how it output's filtered read csv? I don't see anywhere in docs about it.

Also, I think the reason why, for the reads that have an identified sequence target, there is a similar number of aligned and filtered reads is that the same read that gets filtered in target made with terminal primers is aligning to target made with middle aligning primers. As shown in attached.

jbloom commented 4 years ago

@Bernadetadad, as far as the first question, termini5 and termini3 have no special meaning to alignparse. They are just user-defined features, just like sequenced_mRNA_1, cellbarcode, and whatever else you define.

I think in all the examples in the alignparse documentation, the templates being analyzed have termini distinct from the gene of interest and other features, which are why termini5 / termini3 features are included.

In your molecule as I understand it you do not have termini distinct from the mRNA regions? In that case, there isn't really a need to define a termini5 / termini3 feature. You just want to define features that let you get out what you want from the overall PacBio amplicon.

Alternatively, if you want those features so for instance you can require exact matches at expected termini of molecule, the maybe include them. But hopefully it makes sense why your alignment specs only make sense if the clipping is set the same for the termini5 and first mRNA chunk (and likewise at 3' end). If the termini5 is clipped at the 5' end, then by definition the first mRNA chunk would be too.

In other words, think about it as taking the expected amplicon and breaking it down into the different elements you want to require (with various levels of allowed mismatches) and/or extract sequences for, and then just define whatever features are necessary to achieve that. This should help you determine if you need a termini5 / termini3 feature at all.

jbloom commented 4 years ago

As far as the notion of reads aligning to the same gene in termini and middle arrangements, and then being filtered in one of them: maybe this is happening but I'm not sure. At least my understanding from the minimap2 docs is that you have an option to output just the primary (best) alignment, or primary and secondary alignments. I think the suggested alignparse defaults should not be outputting secondary alignments (https://jbloomlab.github.io/alignparse/alignparse.minimap2.html#alignparse.minimap2.OPTIONS_VIRUS_W_DEL).

So in that case, I think as long as you are aligning to all targets at once the minimap2 should repeat each read only aligning to either terminal or middle primers, and so it shouldn't be accepted in one and filtered in the other?

In addition, the multi_align argument to Targets.align_and_parse only accepts 'primary': https://jbloomlab.github.io/alignparse/alignparse.targets.html#module-alignparse.targets

jbloom commented 4 years ago

As far as getting the unmapped reads, I think that in the output directory of the alignments there should be a SAM (or maybe BAM) file with all the alignments. I think the easiest way would just be to filter that file for unmapped reads: https://www.biostars.org/p/213436/

Bernadetadad commented 4 years ago

I looked at bam alignment via IGV too, the highlighted read m54228_201020_194205/4325655/ccs has a single nucleotide at the 5' end of sequenced_mRNA_1 (which is also the 5' end of the whole amplicon) that is absent in the reference. This read (and I presume all the others you can see in the attached image with the same pattern) get's discarded for query_clip_5 reason even when config file is set to query_clip5: 20 , as is now in the current branch I also tried query_clip5:5and clip5: 5 for sequenced_mRNA_1 still get the same result Screen Shot 2020-11-03 at 11 46 38 AM

Bernadetadad commented 4 years ago

I think the right thing to do for polyA site is to set mutation_nt_count to null; looking at the IGV you see the insertions at the polyA sites of all kinds (those numbers in violet background) Screen Shot 2020-11-03 at 12 10 01 PM

jbloom commented 4 years ago

@Bernadetadad, for read m54228_201020_194205/4325655/ccs, I think the issue is that it is soft clipped by 32 nucleotides at the 5' end by minimap2. That is what the cigar string and IGV view you posted shows. No why it is soft clipped is not yet clear to me, but this seems like a minimap2 alignment issue rather than an alignparse one in that case, right?

jbloom commented 4 years ago

@Bernadetadad, looking at this more, I think aligner might be giving reasonable behavior?

Here is my (manual, not minimap2) alignment of the first 100 nt of the FASTQ read against the first 100 nt of the PB2termini target:

1 GAGAATAAAAGAACTGAGAGATCTAATGTCGCAG--TCCC----GCA-CTCGCGAGAT-ACTCACTAA-GA-C-CACTGTGGACCAT---A-TGGC-C-A
  ||||||||||||||||||||||||||||||||||  |       | | ||   ||||  | |  |||| |  | ||  ||   ||     | |  | | |
1 GAGAATAAAAGAACTGAGAGATCTAATGTCGCAGAAT---AAAAG-AACT---GAGA-GA-T--CTAATG-TCGCA--GT---CC--CGCACT--CGCGA
  Score=74

It looks good at first, but then all hell breaks out as it doesn't align past the first part.

On the other hand, here is what it looks like if I do what minimap2 is doing: clip off (soft clip) the first 32 nucleotides of the FASTQ query, and then align the next 100 nt to the first 100 nt of the PB2termini target:

1 GAGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACA
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 -AGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACA
  Score=99

I think this suggests that in the FASTQ read itself, the first 32-ish nucleotide occurs twice, and that the way to get the best alignment is what minimap2 does and just soft clip off the first 32 nucleotides.

This then obviously fails the alignparse filters since there are 32 nucleotides clipped off the 5' end (I'm guessing it would pass if you upped query_clip5 to 32).

But I'm not sure if we want to do this? I'm not sure why first 32 nucleotides are repeated? Maybe this is some aspect of the PCR that actually makes sense that we just forgot to think of when designing target sequences for alignment? Or maybe it indicates some sort of off-target PCR product that makes it safer just to discard the entire read.

I guess the better behavior depends on:

If you can understand why first 32 nt are repeated
If this sort of thing is common

If it's not common, then probably better to just discard these reads. If it is common, then it's probably good to understand the origin?

Bernadetadad commented 4 years ago

Yes, I did not notice this, it is indeed soft clipped, but doesn't alignparse only sees what happens after clipping, which is a read that is missing one nucleotide at it's 5' end? And if so, adding a query_clip_5 anything larger than 1 should still include this read.

jbloom commented 4 years ago

No, soft clipping is still clipping of the query, so it's not ignored. The idea being that if there is some mysterious sequence at termini that has be soft clipped, it should make us suspicious about read. Do we really want to be counting a read that has some unexplained 32 nucleotide extension?

dbacsik commented 4 years ago

A note about weird 5' sequences: The TSO can be found at the 5' end of some reads, and this is expected at a low rate in 10X libraries. This doesn't seem to be the case here, as this looks like authentic PB2 sequence that is repeated. Just thought I would make a note of this.

Bernadetadad commented 4 years ago

I'm almost certain this must be some sort of PCR artifact as it is the exact 32nt primer binding site that gets duplicated and the same thing happens with the 3' query site as well. My guess is that primers end up somehow acting as a template at the end of the PCR product and that leads to duplicated sequences, but staring at the templates I have not yet managed to figure out exactly what process is leading to these artifacts this. And I think a lesson for me is that next time it would better to linearize with primers that are adjacent rather than overlapping.

As far as the frequency for PB2termini templates this happens in ~4,000 amplicons out of 27,000 (including clipping at both 5 and 3 ends) and in PB2mid in around 1,000 out of 10,000 reads. I think we should still include these reads as good amplicons by allowing more clipping as it is a PCR artifact that should not by itself affect the amplicon sequence.

jbloom commented 4 years ago

@Bernadetadad, is it possible it somehow arises when the circularization actually concatenates together multiple different molecules?

In any case, I haven't looked at actual data in detail, but I think we should be cautious about relaxing criteria to include them. It might be OK. But on the other hand, increasing coverage by 10% is not that significant of an improvement (we can always just sequence more), whereas introducing spurious mutations / molecules is a very significant problem. So I'd suggest one of two things:

Minimally, once all mutation calling is set, do with specs that do / do not exclude these clipped sequences to make sure results are not changing.
I can quickly write a script that we add to the pipeline that specifically identifies reads with exact duplications of the known termini at the end and remove one copy of the termini. This might be a "safer" and more targeted strategy as we aren't then letting through all molecules regardless of composition at first / last 30 nt. If you want me to add such a rule, just let me know. To do it, I just need a file that gives the expected duplicated termini region (I think this is essentially the primers?) at each end.

Bernadetadad commented 4 years ago

I don't see exactly how circularization could lead to this, the circularization itself is a HIFi assembly via Read1 region, that is not next to the primer binding sites. If concatamers are made via PCR off the circularization product (though I don't think there where that many made based on tapestation gel) I don't see how they would have duplicated just the primer sites.

I really think it would be fine to include it because it is most certainly some aspect of PCR that leads to this and these templates don't really change the sequence we are interested in. That said, given that I still cannot explain exactly how these duplicated ends form it is probably fair to exclude them for now.

One more parameter that I'd like to double check is the reads that are discarded due to final_mRNA_nts mutation_op_count, as illustrated in the attached images (one shows zoomed out one zoomed in amplicons), these reads get discarded , but they are actually totally good reads and the on'y issue is that the dT primer has 5'-AVN-3' at it's end and so it can introduce either additional As/ Cs/Gs at the end where I've annotated the end of the mRNA molecule and the aligner ends up counting too many cs mutation tags in there. I could either increase mutation_op_count , currently it is 2, or I'm not totally sure what exact benefit does annotating `final_mRNA_nts' give in this case.

As opposed to the end of the mRNA, I think that probably all reads we are interested should have the 5' end of the viral mRNA otherwise it is probably some TSO mispriming.

jbloom commented 4 years ago

OK, sounds good. A few thoughts / comments:

So do you just want to keep allowing 32 nt clipping at the query ends, or do you want me to write a rule that specifically eliminates reads with repeated primer sequences at end and then only allow minimal clipping? Decision would not affect how we handle the reads with the repeated primer at termini (they'd be accepted either way), but whether we accept other reads that might have some other spurious sequence at their ends (not sure if these exist much, and if so whether they are bad).
As far as the final_mRNA_nts, I'm not sure if this is needed either. Only reason that pops to mind is if we want polyA mis-alignment to go into this feature rather than sequenced_mRNA, but not sure if this is important.
Other stuff you are mentioning sounds good. But just to beat a dead horse, as you're setting this up just keep in mind that it vastly better to throw out 10% of reads that you can actually see by eye are valid because you're being too stringent than to keep even 1% of reads that actually have something wrong with them that lead us to call spurious mutations in regions of viral gene we care about (or mis-call barcode / UMI). The reason is that the former just means we have 10% less data (easily fixed by more sequencing), while the latter leads to actual contamination of our results. I'm not saying you are doing this wrong (not paying enough attention to details / statistics), but just keep this principle in mind: the most common mistake people make in NGS analysis is spending inordinate amounts of time trying to rescue a small percentage of the data.
One other issue to keep in mind as we move forward: at some point we will also have to translate this back into viral coordinates, including to call amino-acid mutations. Not sure best way to do this yet, and we can meet to brainstorm. Would we also annotate coding regions, or just reconstruct the viral mRNA mutations and then have separate code that gets it into coding frame?

Bernadetadad commented 4 years ago

I think it would be nice to have such clipping rule. But it would be not just 32nt clipping but a removal of duplicated sequences at the ends, which could be of different lengths depending on the primer length I've used. Do you need a list of primers for that?
Currently sequenced_mRNA annotates the ORF there's an unannotated gap between the stop codon and polyA site.
Yes I get that.
I'll finish annotations on all segments, run the alignment and see if I still have the very large unmapped percentage. If so I will then sort out that and then we can talk how to convert back sequence coordinates.

jbloom commented 4 years ago

Sounds good. For the clipping rule, can you just post the sequences that are possible termini (either just directly into issue or as some text file) along with some explanation of what the sequences are, and then I can write this rule.

Bernadetadad commented 4 years ago

Do you mean post how the duplicated sequence would look? or just the actual sequence that should be at the termini

jbloom commented 4 years ago

My understanding is that there is some expected sequence (e.g., 32 nt) at the termini of each target. For instance, above we were looking at it for PB2termini. And in some cases that sequence might be duplicated. So we want to find all sequences that have it duplicated, and then remove the duplicated termini so it passes the original filters, right?

If that is the case, then I'd need to know the expected sequence that could be duplicated (I think this is the primer?) at each end of each target.

Bernadetadad commented 4 years ago

Yes, that's correct. Sequences attached. These duplications can be present either at 5 or 3 ends of amplicons. I've noticed also duplications tend to miss the first nucleotide as in the example above GAGAATAAAAGAACTGAGAGATCTAATGTCGC-AGAATAAAAGAACTGAGAGATCTAATGTCGC duplicated_sequences.txt

jbloom commented 4 years ago

@Bernadetadad, and just to clarify: at 3' end we expect reverse complement of these sequences, is that correct?

Bernadetadad commented 4 years ago

No, it's the same sequence at both ends as shown here for 5' and 3' ends

Bernadetadad commented 4 years ago

@jbloom I've made amplicons for all segments and done alignments but I still get a huge number (176,783) of reads through out. Looking at the unmapped reads they are very good flu sequences (as in attached). The only thing that's different about these as opposed to aligned sequences is that they all seem to map to the reverse stand. I'm using OPTIONS_VIRUS_W_DEL parameters for mapping and I see that you have --for-only set which only maps reads to the forward strand of the reference sequences, so does that mean that anything that maps to reverse strad actually gets discarded or that the mapped output read is always given as a forward reference?

jbloom commented 4 years ago

@Bernadetadad, this is a really good question.

I cannot think of any reason why we should only keep the forward reads. It would seem like depending on which strand is being sequenced either could be valid.

I'm not sure if there was a good reason that I've now forgotten, or if this was just a mistake when I chose the values of OPTIONS_VIRUS_W_DEL and make them include --for-only. But I think it's more likely the latter?

Do you want to try running without --for-only? (You can just modify code in your pipeline that passes all the options currently in OPTIONS_VIRUS_W_DEL except that one.) If that indeed then seems to be working better on the real data and doesn't uncover some hidden and forgotten reason for the --for-only option, then can you either make a pull request to alignparse or raise an issue to alignparse fixing the options to remove that?

Bernadetadad commented 4 years ago

Looks like --for-only was indeed the issue, without it only 1,887 reads are unmapped! I've add an issue #72

jbloomlab / barcoded_flu_pdmH1N1

Align pacbio #95