Closed Bernadetadad closed 4 years ago
@Bernadetadad, I may be slightly confused about what I am supposed to be looking at as you have two Genbank files in your data
directory for the PacBio sequencing, one of which is described in the README and the other of which is used by the notebook. What I'm posting below is based on looking at fluCA09.gb
.
But as best I can tell from looking for a bit, several things confuse me:
What exactly are termini3
and termini5
even supposed to represent in your Genbank files? They appear to completely overlap the regions you define as the coding mRNA. If your amplicons start and end exactly on the viral mRNA, then why are you defining separate "termini"? If your amplicons don't begin and end exactly on the viral mRNA, then the termini should be separate from the viral mRNA. In particular, it doesn't make any sense to define for instance termini5
and sequenced_mRNA_1
as having same 5' termini, and then allowing 5' clipping only on the former. If the former is clipped, then of course the latter will be too as they have the same 5' termini?
It seems to me that also a lot of your reads just aren't mapping at all to any of your targets ("unmapped"). Have you looked to see what these are and if they contain influenza sequence?
You should be able to extract the names of specific sequences that fail specific filters and look at them manually to see where things are going wrong.
So, I thought that termini5
and termini3
are supposed to be the ends of amplicons (meaning sequences at the end of the final PCR product, which are the sites where the primer primers that linearize the circular templates anneal). Which is why my termini sequences are in the coding regions because my primers that linearize the circular template are just at the start of the coding region. Or are these supposed to be just in general: termini5
== 5' of the mRNA and termini3
== read1 site (so the structure of RNA that is formed after the cDNA amplification in 10x protocol). attached is how the amplicon looks after final PCR for PB2 enrichment Read1 and 5' terminus of mRNA in this case would be next to each other (primers are in violet).
Also, @jbloom is there a way I can get alignparse to make a csv with a list of unmapped read IDs, similar to how it output's filtered read csv? I don't see anywhere in docs about it.
Also, I think the reason why, for the reads that have an identified sequence target, there is a similar number of aligned and filtered reads is that the same read that gets filtered in target made with terminal primers is aligning to target made with middle aligning primers. As shown in attached.
@Bernadetadad, as far as the first question, termini5
and termini3
have no special meaning to alignparse
. They are just user-defined features, just like sequenced_mRNA_1
, cellbarcode
, and whatever else you define.
I think in all the examples in the alignparse
documentation, the templates being analyzed have termini distinct from the gene of interest and other features, which are why termini5
/ termini3
features are included.
In your molecule as I understand it you do not have termini distinct from the mRNA regions? In that case, there isn't really a need to define a termini5
/ termini3
feature. You just want to define features that let you get out what you want from the overall PacBio amplicon.
Alternatively, if you want those features so for instance you can require exact matches at expected termini of molecule, the maybe include them. But hopefully it makes sense why your alignment specs only make sense if the clipping is set the same for the termini5
and first mRNA chunk (and likewise at 3' end). If the termini5
is clipped at the 5' end, then by definition the first mRNA chunk would be too.
In other words, think about it as taking the expected amplicon and breaking it down into the different elements you want to require (with various levels of allowed mismatches) and/or extract sequences for, and then just define whatever features are necessary to achieve that. This should help you determine if you need a termini5
/ termini3
feature at all.
As far as the notion of reads aligning to the same gene in termini and middle arrangements, and then being filtered in one of them: maybe this is happening but I'm not sure. At least my understanding from the minimap2
docs is that you have an option to output just the primary (best) alignment, or primary and secondary alignments. I think the suggested alignparse
defaults should not be outputting secondary alignments (https://jbloomlab.github.io/alignparse/alignparse.minimap2.html#alignparse.minimap2.OPTIONS_VIRUS_W_DEL).
So in that case, I think as long as you are aligning to all targets at once the minimap2
should repeat each read only aligning to either terminal or middle primers, and so it shouldn't be accepted in one and filtered in the other?
In addition, the multi_align
argument to Targets.align_and_parse
only accepts 'primary': https://jbloomlab.github.io/alignparse/alignparse.targets.html#module-alignparse.targets
As far as getting the unmapped reads, I think that in the output directory of the alignments there should be a SAM (or maybe BAM) file with all the alignments. I think the easiest way would just be to filter that file for unmapped reads: https://www.biostars.org/p/213436/
I looked at bam alignment via IGV too, the highlighted read m54228_201020_194205/4325655/ccs
has a single nucleotide at the 5' end of sequenced_mRNA_1
(which is also the 5' end of the whole amplicon) that is absent in the reference. This read (and I presume all the others you can see in the attached image with the same pattern) get's discarded for query_clip_5
reason even when config file is set to query_clip5: 20
, as is now in the current branch I also tried query_clip5:5
and clip5: 5
for sequenced_mRNA_1
still get the same result
I think the right thing to do for polyA site is to set mutation_nt_count
to null
; looking at the IGV you see the insertions at the polyA sites of all kinds (those numbers in violet background)
@Bernadetadad, for read m54228_201020_194205/4325655/ccs
, I think the issue is that it is soft clipped by 32 nucleotides at the 5' end by minimap2
. That is what the cigar string and IGV view you posted shows. No why it is soft clipped is not yet clear to me, but this seems like a minimap2
alignment issue rather than an alignparse
one in that case, right?
@Bernadetadad, looking at this more, I think aligner might be giving reasonable behavior?
Here is my (manual, not minimap2
) alignment of the first 100 nt of the FASTQ read against the first 100 nt of the PB2termini target:
1 GAGAATAAAAGAACTGAGAGATCTAATGTCGCAG--TCCC----GCA-CTCGCGAGAT-ACTCACTAA-GA-C-CACTGTGGACCAT---A-TGGC-C-A
|||||||||||||||||||||||||||||||||| | | | || |||| | | |||| | | || || || | | | | |
1 GAGAATAAAAGAACTGAGAGATCTAATGTCGCAGAAT---AAAAG-AACT---GAGA-GA-T--CTAATG-TCGCA--GT---CC--CGCACT--CGCGA
Score=74
It looks good at first, but then all hell breaks out as it doesn't align past the first part.
On the other hand, here is what it looks like if I do what minimap2
is doing: clip off (soft clip) the first 32 nucleotides of the FASTQ query, and then align the next 100 nt to the first 100 nt of the PB2termini target:
1 GAGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACA
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 -AGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACA
Score=99
I think this suggests that in the FASTQ read itself, the first 32-ish nucleotide occurs twice, and that the way to get the best alignment is what minimap2
does and just soft clip off the first 32 nucleotides.
This then obviously fails the alignparse
filters since there are 32 nucleotides clipped off the 5' end (I'm guessing it would pass if you upped query_clip5
to 32).
But I'm not sure if we want to do this? I'm not sure why first 32 nucleotides are repeated? Maybe this is some aspect of the PCR that actually makes sense that we just forgot to think of when designing target sequences for alignment? Or maybe it indicates some sort of off-target PCR product that makes it safer just to discard the entire read.
I guess the better behavior depends on:
If it's not common, then probably better to just discard these reads. If it is common, then it's probably good to understand the origin?
Yes, I did not notice this, it is indeed soft clipped, but doesn't alignparse
only sees what happens after clipping, which is a read that is missing one nucleotide at it's 5' end? And if so, adding a query_clip_5
anything larger than 1 should still include this read.
No, soft clipping is still clipping of the query, so it's not ignored. The idea being that if there is some mysterious sequence at termini that has be soft clipped, it should make us suspicious about read. Do we really want to be counting a read that has some unexplained 32 nucleotide extension?
A note about weird 5' sequences: The TSO can be found at the 5' end of some reads, and this is expected at a low rate in 10X libraries. This doesn't seem to be the case here, as this looks like authentic PB2 sequence that is repeated. Just thought I would make a note of this.
I'm almost certain this must be some sort of PCR artifact as it is the exact 32nt primer binding site that gets duplicated and the same thing happens with the 3' query site as well. My guess is that primers end up somehow acting as a template at the end of the PCR product and that leads to duplicated sequences, but staring at the templates I have not yet managed to figure out exactly what process is leading to these artifacts this. And I think a lesson for me is that next time it would better to linearize with primers that are adjacent rather than overlapping.
As far as the frequency for PB2termini templates this happens in ~4,000 amplicons out of 27,000 (including clipping at both 5 and 3 ends) and in PB2mid in around 1,000 out of 10,000 reads. I think we should still include these reads as good amplicons by allowing more clipping as it is a PCR artifact that should not by itself affect the amplicon sequence.
@Bernadetadad, is it possible it somehow arises when the circularization actually concatenates together multiple different molecules?
In any case, I haven't looked at actual data in detail, but I think we should be cautious about relaxing criteria to include them. It might be OK. But on the other hand, increasing coverage by 10% is not that significant of an improvement (we can always just sequence more), whereas introducing spurious mutations / molecules is a very significant problem. So I'd suggest one of two things:
I don't see exactly how circularization could lead to this, the circularization itself is a HIFi assembly via Read1 region, that is not next to the primer binding sites. If concatamers are made via PCR off the circularization product (though I don't think there where that many made based on tapestation gel) I don't see how they would have duplicated just the primer sites.
One more parameter that I'd like to double check is the reads that are discarded due to final_mRNA_nts mutation_op_count
, as illustrated in the attached images (one shows zoomed out one zoomed in amplicons), these reads get discarded , but they are actually totally good reads and the on'y issue is that the dT primer has 5'-AVN-3' at it's end and so it can introduce either additional As/ Cs/Gs at the end where I've annotated the end of the mRNA molecule and the aligner ends up counting too many cs mutation tags in there. I could either increase mutation_op_count
, currently it is 2, or I'm not totally sure what exact benefit does annotating `final_mRNA_nts' give in this case.
As opposed to the end of the mRNA, I think that probably all reads we are interested should have the 5' end of the viral mRNA otherwise it is probably some TSO mispriming.
OK, sounds good. A few thoughts / comments:
So do you just want to keep allowing 32 nt clipping at the query ends, or do you want me to write a rule that specifically eliminates reads with repeated primer sequences at end and then only allow minimal clipping? Decision would not affect how we handle the reads with the repeated primer at termini (they'd be accepted either way), but whether we accept other reads that might have some other spurious sequence at their ends (not sure if these exist much, and if so whether they are bad).
As far as the final_mRNA_nts
, I'm not sure if this is needed either. Only reason that pops to mind is if we want polyA mis-alignment to go into this feature rather than sequenced_mRNA
, but not sure if this is important.
Other stuff you are mentioning sounds good. But just to beat a dead horse, as you're setting this up just keep in mind that it vastly better to throw out 10% of reads that you can actually see by eye are valid because you're being too stringent than to keep even 1% of reads that actually have something wrong with them that lead us to call spurious mutations in regions of viral gene we care about (or mis-call barcode / UMI). The reason is that the former just means we have 10% less data (easily fixed by more sequencing), while the latter leads to actual contamination of our results. I'm not saying you are doing this wrong (not paying enough attention to details / statistics), but just keep this principle in mind: the most common mistake people make in NGS analysis is spending inordinate amounts of time trying to rescue a small percentage of the data.
One other issue to keep in mind as we move forward: at some point we will also have to translate this back into viral coordinates, including to call amino-acid mutations. Not sure best way to do this yet, and we can meet to brainstorm. Would we also annotate coding regions, or just reconstruct the viral mRNA mutations and then have separate code that gets it into coding frame?
sequenced_mRNA
annotates the ORF there's an unannotated gap between the stop codon and polyA site.Sounds good. For the clipping rule, can you just post the sequences that are possible termini (either just directly into issue or as some text file) along with some explanation of what the sequences are, and then I can write this rule.
Do you mean post how the duplicated sequence would look? or just the actual sequence that should be at the termini
My understanding is that there is some expected sequence (e.g., 32 nt) at the termini of each target. For instance, above we were looking at it for PB2termini
. And in some cases that sequence might be duplicated. So we want to find all sequences that have it duplicated, and then remove the duplicated termini so it passes the original filters, right?
If that is the case, then I'd need to know the expected sequence that could be duplicated (I think this is the primer?) at each end of each target.
Yes, that's correct. Sequences attached.
These duplications can be present either at 5 or 3 ends of amplicons. I've noticed also duplications tend to miss the first nucleotide as in the example above GAGAATAAAAGAACTGAGAGATCTAATGTCGC-AGAATAAAAGAACTGAGAGATCTAATGTCGC
duplicated_sequences.txt
@Bernadetadad, and just to clarify: at 3' end we expect reverse complement of these sequences, is that correct?
No, it's the same sequence at both ends as shown here for 5' and 3' ends
@jbloom I've made amplicons for all segments and done alignments but I still get a huge number (176,783) of reads through out. Looking at the unmapped reads they are very good flu sequences (as in attached). The only thing that's different about these as opposed to aligned sequences is that they all seem to map to the reverse stand. I'm using OPTIONS_VIRUS_W_DEL
parameters for mapping and I see that you have --for-only
set which only maps reads to the forward strand of the reference sequences, so does that mean that anything that maps to reverse strad actually gets discarded or that the mapped output read is always given as a forward reference?
@Bernadetadad, this is a really good question.
I cannot think of any reason why we should only keep the forward reads. It would seem like depending on which strand is being sequenced either could be valid.
I'm not sure if there was a good reason that I've now forgotten, or if this was just a mistake when I chose the values of OPTIONS_VIRUS_W_DEL
and make them include --for-only
. But I think it's more likely the latter?
Do you want to try running without --for-only
? (You can just modify code in your pipeline that passes all the options currently in OPTIONS_VIRUS_W_DEL
except that one.) If that indeed then seems to be working better on the real data and doesn't uncover some hidden and forgotten reason for the --for-only
option, then can you either make a pull request to alignparse
or raise an issue to alignparse
fixing the options to remove that?
Looks like --for-only
was indeed the issue, without it only 1,887 reads are unmapped! I've add an issue #72
Aligning pacbio reads to flu references.
@jbloom when I'm aligning the reads there seems to be some issue in my annotation because
[most reads end up not aligning](notebooks/align_pacbio.py.ipynb)
. The most common reasons are 5' and 3' clip forsequenced_mRNA_1
andsequenced_mRNA_2
, respectively. Maybe I don't totally understand howtermini5
andtermini3
need to be defined. At the moment I have these defined as sequences expected at the ends of length amplicons, but are these perhaps supposed to be termini of what is annotated assequenced_mRNA_1/2
?