psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Indels at near beginning of sequence (FR1-FR2) not being removed #225

Closed Irrationone closed 6 years ago

Irrationone commented 7 years ago

Hi Duncan,

I ran partis and extracted large clonal families with the full algorithm (after obtaining seeds with the fast vsearch method). The sequences don't appear to be properly aligned up to the CDR2 region (but are fine after that). Attached a heatmap of the sequences (ACGTN=1..5).

https://drive.google.com/open?id=0B5MDPqi4noNfQ1BiV2VKRko0S2c

This is a consistent pattern among most (>90%) of the clonal families I'm looking at (large ones, with >250 sequences).

Thanks!

psathyrella commented 7 years ago

hm, which sequences are these that aren't aligned? Like, 'input_seqs', or 'indel_reversed_seqs', in the output csv?

So the way we handle shm indels is the initial smith waterman step "reverses" them before passing things to the hmm, and then hmm output reports both the initial input, and indel-reversed sequences. I did this latter part poorly, well, it wasn't as clear as it should have been what was input and what was indel-reversed. That should have been mostly clarified when the 'seqs' column was split into the two columns (in the last month), but I'm still fixing a few things in that regard.

Or, uh, maybe you mean the problem's something else? Can you attach the output csv?

Irrationone commented 7 years ago

These results were from an old version of partis (probably ~1.5 months ago), when there was only the seqs column. They were all the same length, so I assumed they were already indel-reversed.

Here's the output csv:

https://drive.google.com/open?id=0B5MDPqi4noNfNHZpX2VYNUxFOHc

psathyrella commented 7 years ago

No, you're right, the old 'seqs' column was the 'indel_reversed' seqs.

I can't seem to extract consistent anntoations from that file, it looks like it's been modified (it's tsv not csv, and there's additional columns), perhaps that could be related to the problem?

for instance, on HS22_80_H27KGBCXX_1_1101_10415_15349__ith2_4_ROv2, the 'seqs' column contains the fv insertion, but doesn't seem to have the two-base jf insertion?

Irrationone commented 7 years ago

Yeah -- I ran partis on multi-sample data (from the same patient), so I did vsearch clustering of a merged subset, seeded full clustering of each sample (using the naive sequence as the seed) to get clusters, and merged the final clusters that contained the naive sequence seed. That's where the tsv was from.

The jf_insertion column is actually missing values, i.e. NA's (from working in R) -- I realize that's really confusing. So it should be blank.

I've uploaded the original output csv for one of the samples (the one containing that sequence):

https://drive.google.com/open?id=0B5MDPqi4noNfRnQ5d2M2b2RTX28

Edit: I should add -- the weird insertion patterns in the heatmap weren't clustered by sample. See: https://drive.google.com/open?id=0B5MDPqi4noNfZEtYbEpxanFZTUU

psathyrella commented 7 years ago

oh, yeah, that is definitely the cause of the crash I had above -- jf_insertion is the actual inserted string, so it will think that's an N followed by an A, which will be inconsistent with the rest of the event.

Even if I fix that, though, I still can't read the merged tsv, it looks like you changed indel info formats?

The original csv looks ok, though. There's certainly a lot of shm indels, so if there's a parsing problem with the indelfo format that could mess things up.

Irrationone commented 7 years ago

Yeah -- I changed the indel formats in the tsv. I currently don't do anything with the indels.

Also, these results are from running with the log-likelihood ratio = 30 (instead of the default 18), to try and obtain tighter clusters.

psathyrella commented 7 years ago

oh, wait, do those big red areas just correspond to the N padding? p

Irrationone commented 7 years ago

yep!

Edit: it appears that in a lot of cases N padding is done instead of calling an indel

Irrationone commented 7 years ago

Just wanted to clarify -- this is not expected behaviour, right? Is there anything else I can provide to help out?

psathyrella commented 7 years ago

oh, sorry, I didn't see the edit. N padding is definitely expected behavior.

As far as N padding vs indels, there certainly could be cases where it gets confused and makes the wrong call, but I haven't seen it. Which sequences were you thinking were examples of this?

Irrationone commented 7 years ago

The issue is most evident from the heatmaps -- the first portion of the sequences (up to CDR2, i.e. after the red N-padding regions) are slightly misaligned. For example, the first chunk of sequences at the top of the heatmap looks to be 2 nucleotides off from the chunk below it. It's almost as if too much/not enough N-padding was done for some of the sequences, instead of calling an indel near the end of FR2.

I put a black rectangle around an example of this here:

https://drive.google.com/open?id=0B5MDPqi4noNfdV9XRkh4TFlTRmM

psathyrella commented 7 years ago

oh, I see, I completely misunderstood what the problem was before. I'll dig in to try and identify which sequences correspond to the box to see if anything looks fishy.

psathyrella commented 7 years ago

ok I've been poking around a little more in the files you sent, and I still can't see anything obviously wrong, but that may be because I don't know exactly which sequences are a problem where. Could you list, for instance, the sequence immediately above and below the break highlighted by your black box in that ^ pdf?

I also had an idea of what could be going wrong. It's kind of hard to explain, and I can't verify from here without your entire input data and workflow, but I suspect your second clustering step takes as input the N-padded sequences from the first step?

If so, and your workflow is easy to run, could you try running the second step, but giving it your actual original input sequences that you passed to the first step? I could imagine two N-padding steps could interfere with each other in the presence of shm indels. If that fixes it, I'll look into removing the N padding from the 'input_seqs' column in the csv output, which obviously given that it now is called that and now exists separately from 'indel_reversed_seqs', probably makes sense but may be somewhat non-trivial.

Irrationone commented 7 years ago

In the top cluster: HS22_80_H27KGBCXX_2_1113_20933_89362ith2_4_ROv2 HS22_80_H27KGBCXX_2_2112_5589_10960ith2_4_ROv2 In the bottom cluster: HS22_80_H27KGBCXX_2_1204_6674_39099__ith2_4_ROv2

If the sequences are more convenient to work with, I've uploaded those:

https://drive.google.com/open?id=0B5MDPqi4noNfaG5rWWdVaWlTS1E

Yep, the second step uses the seed sequence from the first step, and the seed can have N padding. The one for this cluster is:

NNNNNNNNNCAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGACCTATACGATGACTACAGTTGAGAATGCTATTGATATGTGGGGCCAAGGGACAATGGTCTCCGTCTCCTCAG

I use the naive sequence from the first (vsearch) step as the seed. It looks like this includes fv_insertion. Do you think deleting the fv_insertion is the right thing to do? Will there be issues w/ the seeded alignment step due to the change in length of the seed?

Alternatively I could use one of the input sequences; but I thought it was better to use the naive sequence from the vsearch clusters.

psathyrella commented 7 years ago

one comment re: further back: if you're adjusting the log prob ratio threshold, you may find it useful to (or may need to) also adjust the naive hamming threshold, as well. Without rewriting all the details of how the clustering works, we first try to use naive hamming distance, and if that's really small we merge without checking likelihood, and if it's really big we don't bother to calculate the likelihood. So if you're using a more stringent likelihood ratio threshold, it may be important to also adjust the naive hamming bounds (--naive-hamming-bounds lo:hi)

The other, perhaps better (for technical reasons having to do with paralellization), way to get more stringent clusters is to go about it globally by increasing --n-partitions-to-write significantly, and taking a partition that's a ways before the most likely partition.

That said, the way you're going about it may be fine, but if you're generally worried about uncertainties trying a few different ways would give you some idea.


Now, I'm still sorting out what's even happening and what's wrong with the problem you're seeing, but if it helps in the meantime, it seems like either the N-padding or shm-indel detection is getting confused by something about the input to the second step being output from the first step, rather than "regular" input. It seems likely that this difference is the N-padding: normally it adds N-padding after it's finished deciding everything else about the sequence, i.e. the Ns represent, well, padding. But Ns in an input sequence are viewed as ambiguous bases, which means it could decide that an shm indel goes in the middle of a bunch of Ns, since it doesn't know that the Ns are supposed to be viewed as padding.

In any case, this suggests removing the damn N padding from all output sequences before feeding them back in. And I will probably just stop putting the N padding on the output sequences altogether. But first I'm figuring out how to actually replicate your problem so I know if I've fixed it, 'cause it could be something else altogether.

psathyrella commented 7 years ago

huh, actually, maybe it's nothing nearly so fancy. If I run just on those three sequences you mention with default settings, it gives the annotation we've been talking about, with suspiciously high mutation in the first 25 or so V bases:

p

But if I just crank down the --gap-open-penalty from the default 30 to 10 (EDITED), it finds indels there:

indels

which results in a final annotation which, indeed, has more homogeneous mutation over the whole v: with-indels

So is the second annotation correct? eh.... probably?

But the default gap open penalty is a tradeoff between, in a typical data set, not missing too many real indels near the ends of sequences, and not introducing too many spurious indels. It's certainly possible that the default value could stand to be re-optimized -- I have little recollection of the details of the original optimization process other than that it happened -- but at least in the short term I'm inclined to say perhaps your data set is pretty atypical: it looks like very high shm indel rates, and lots (all?) sequences have reads that don't extend all the way through V, which combine to give lots of sequences with shm indels near the end of the read.

Irrationone commented 7 years ago

I'll try running without the N padding.

Regarding --gap-open-penalty -- this seems to already be at a default of 30 in the version of partis I pulled this morning, not the 50 you mentioned. Is it just higher in the dev version?

psathyrella commented 7 years ago

Oh, I would definitely try changing the gap-open penalty first. The default is indeed 30, and I had switched it to 10 to make those pngs ^.

Sorry, I have no idea why my brain added twenty to both numbers in the intervening seven seconds.

(editing in the above post for the sake of posterity)

Irrationone commented 7 years ago

Running with gap-open-penalty of 10 now. I'm getting a lot of errors like so when I use cache-parameters:

[ig_align] Error: match length mismatch for query HS22_80_H27KGBCXX_1_1207_9525_77004__ith1_2_ROv1 score 505 target IGHV4/OR15-8*01: qe - qb = 449 - 4 = 445 != match_length - 1 = 444
        HS22_80_H27KGBCXX_1_1207_9525_77004__ith1_2_ROv1        256     IGHV4/OR15-8*01 1       40      4S10M2I13M5I6M3D12M4I10M10I13M2I9M6I11M12I14M5I5M49I32M21I44M3D30M1I4M4I14M1D10M7I14M1I6M2I9M7I12M1D7M19I2M1I    CTGGCAGTGGTATCAACGCAGAGTACGGGCTTTTCTCAGGGGAGAAGCCATCACTTGAAGATGCTGAGTCTTCTGCTCCTTCTCCTGGGACTAGGCTCTGTGCTCAGTGCTGTCATCTCTCAAAAGCCAAGCAGGGATATCTGTCAACGTGGAACCTCCCTGACGATCCAGTGTCGAGTCGATAGCCAAGTCACCATGATGTTCTGGTACCGTCAGCAACCTAGACAGAGCCTGACACTGATCGCAACTGCAAGTGAGGGCTCTGAGGACACATATGAGAGTGGATTTGTCATTGACAAGTTCCCCATCAGCCGCCCAAACCTAACATTCTCACCTCTGACTGTGAGCACCATGCGCACTGAAGACAGCAGCTTATCTCTCTGCAGCGCCTGGGGGACCCGCTACAATGAGCTGTTCTTCGGGCCAGGCACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTC

Also, the number of sequences and procs used at each stage of SW goes down much slower, i.e.

485721       16
476198       15

when I run with penalty of 10, vs:

485721       16
323775       10

when I run with a penalty of 30.

psathyrella commented 7 years ago

Well, that is a somewhat longstanding thing that ig-sw sometimes kicks out cigar and query lengths that are inconsistent. This is on the list of things to fix, but it's non-trivial, and under normal circumstances "sometimes" is "like, totally, never", though. So since the cigar string for that sequence is crazy impossible (it's, like, entirely shm indels):

4S10M2I13M5I6M3D12M4I10M10I13M2I9M6I11M12I14M5I5M49I32M21I44M3D30M1I4M4I14M1D10M7I14M1I6M2I9M7I12M1D7M19I2M1I

I'd say try a penalty of 20 or some other intermediate number? Sorry, even if I could work on the ig-sw thing right now it would not be fast to fix (I tried to fix already, and need to spend much more time understanding that code base).

The sequence/proc thing is just a consequence of more sequences having weird annotations: each cycle through s-w re-runs sequences that have suspicious annotations, for a number of suspicious criteria, and with the gap-open too low, a ton ton have loads of indels.

psathyrella commented 7 years ago

I don't know anything about this data set, but I also wonder if it's possible that many of the things that look like shm indels are just our brains pattern-recogniting trying to line up as many of the vertical streaks in these as they can, and it's really just the read quality degrades toward the 5' end of the read? I mean with enough sequences and enough mutations some mutation patterns will create a pattern that's also plausible as an shm indel, especially toward the end of the sequence.

Said another way, if we did have a data set that had higher error rates near the ends of the reads, some fraction of that dataset would definitely appear to have plausible shm indels in that higher-error region. Enough to explain this? I dunno.

psathyrella commented 7 years ago

Or said another another way, a lot of the sequences in that heat map look like they'd line up really well if I could skootch 'em around a bit, but there's also a lot of 'em that just look to me like they have a lot of mutation/error on the 5' ends.

Irrationone commented 7 years ago

I ran into this error with one of my samples when caching parameters; not sure if this is the same k-bounds exception as before:

caching parameters
smith-waterman   (removing less-likely alleles)
        sequences    n_procs     ig-sw time    processing time
          938005       12         6141.2        3751.4
          718052        9         6584.6        3291.4
          700178        8Traceback (most recent call last):
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/bin/partis", line 561, in <module>
    args.func(args)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/bin/partis", line 183, in run_partitiondriver
    parter.cache_parameters()
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 246, in cache_parameters
    self.run_waterer(remove_less_likely_alleles=True, count_parameters=True)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 169, in run_waterer
    waterer.run(cachefname if write_cachefile else None)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 91, in run
    self.read_output(base_outfname, n_procs)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 380, in read_output
    self.summarize_query(qinfo, queries_to_rerun)  # returns before adding to <self.info> if it thinks we should rerun the query
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 917, in summarize_query
    kbounds = self.get_kbounds(infoline, qinfo, best)  # gets the boundaries of the non-best matches from <qinfo>
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 988, in get_kbounds
    elif qrseq[k_v_min + icheck - j_start] == glseq[k_v_min + icheck - j_start]:
IndexError: string index out of range
         5678.3
psathyrella commented 7 years ago

Not the same one, although related in the sense that it's just sequences finding new and interesting ways to fail against which I failed to protect.

Thanks for finding. This will be difficult to reproduce, so I'll just add a protection that prints an error and skips the sequence, and hopefully next time it comes up I can reproduce it.

psathyrella commented 7 years ago

ok, I changed it so it

Irrationone commented 7 years ago

Unfortunately, the error still seems to be there. I'm not getting the warning message either. The string index is lower than before, but it's the same error. git log shows that I have your commit (commit 6be42a437119d3a60d34ef6945b054ac75d809b1); and I re-built partis before I started.

caching parameters
smith-waterman   (removing less-likely alleles)
        sequences    n_procs     ig-sw time    processing time
          938005       12         6093.5        3441.6
          718052        9         6527.3        3161.9
          700178        8Traceback (most recent call last):
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/bin/partis", line 561, in <module>
    args.func(args)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/bin/partis", line 183, in run_partitiondriver
    parter.cache_parameters()
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 246, in cache_parameters
    self.run_waterer(remove_less_likely_alleles=True, count_parameters=True)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/partitiondriver.py", line 169, in run_waterer
    waterer.run(cachefname if write_cachefile else None)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 91, in run
    self.read_output(base_outfname, n_procs)
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 380, in read_output
    self.summarize_query(qinfo, queries_to_rerun)  # returns before adding to <self.info> if it thinks we should rerun the query
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 917, in summarize_query
    kbounds = self.get_kbounds(infoline, qinfo, best)  # gets the boundaries of the non-best matches from <qinfo>
  File "/extscratch/shahlab/shahlab/azhang/projects/ITH_Immune/software/partis/python/waterer.py", line 988, in get_kbounds
    elif qrseq[k_v_min + icheck - j_start] == glseq[k_v_min + icheck - j_start]:
IndexError: string index out of range
         5653.9
psathyrella commented 7 years ago

argggg... sorry. I guess the glseq and qrseq are different lengths? That should be impossible, but I'll add a check for that as well, and hopefully it kicks something so we can figure out what's going on.

EDIT oh, wait, no, I think I fixed it wrong yesterday. I was tired, see...

Irrationone commented 7 years ago

Wait, I think I know what the issue is. I had to move my workspace recently and although I'm running it from the right location, it looks like I hardcoded the previous executable's path. My bad, I'll let you know how it goes with the re-run.

psathyrella commented 7 years ago

no, no, wait for the new commit, my fix yesterday was garbage

psathyrella commented 7 years ago

ok, that should do what I said it would do before.

matsen commented 7 years ago

@Irrationone would you mind emailing us? We're grateful for your issue reporting and just curious who you are.

Irrationone commented 7 years ago

Thanks for the fix! Just sent an email with my info.