matsengrp / sumrep

Summary statistics for repertoires
16 stars 6 forks source link

Definitions of mature and naive sequences #13

Closed BrandenOlson closed 5 years ago

BrandenOlson commented 5 years ago

sumrep operates on various levels of sequence assumptions. The first level includes "naked" input/query sequences, and the second level includes annotated, V(D)J-aligned sequences. We want to make sure that the corresponding definitions in sumrep comply with the AIRR standard and are consistent across tools.

The standard includes a sequence field which is "usually ... the unmodified input sequence". For partis, we have been using the input_seqs values as naked sequences, but by default these have "constant regions (fv/jf insertions) removed". However, we probably want these as the "mature" sequences (discussed below), so I am thinking of appending a sequence column to the partis annotation data.frame directly from the query fasta file. I'll have to make this clear in the documentation as others who are using partis externally may have to do this manually.

Furthermore, some functions in the second level (e.g. distribution of distances from naive to mature) require "mature" and "naive" sequences, denoted mature_seq and naive_seq. There are various ways to define these fields so that len(mature_seq) = len(naive_seq), including: (1) Using the query input sequence as mature_seq, and trimming or expanding the inferred germline sequence for naive_seq. However, sometimes the input sequence extends beyond the VDJ region, in which case we could restrict to the VDJ region. Additionally, not all tools readily contain such a truncated/expanded germline inference (I'm not seeing one in Change-O, e.g.). (2) Using the full inferred germline VDJ sequence for naive_seq and expanding or truncating the input sequence for mature_seq. An obvious issue here is that if the input sequence does not span the full VDJ region, then we will have to infer the gaps based on naive_seq, and will thus be underestimating the distance. (3) Restricting both sequences to the CDR3 or junction region. The same sort of issues apply, e.g., if the input sequence doesn't span the full CDR3, or if the tool doesn't return naive and mature CDR3s.

Choice 1 seems the most honest, but I'd love to hear any comments/feedback/advice before we make a commitment.

williamdlees commented 5 years ago

We could do with a general approach/policy towards sequence truncation etc.

I suggest that we always restrict to the variable region. The constant region is not well covered in sequence sets today. It would be an interesting area to explore, but a bit outside the remit, I think.

It would be useful to address partial sequences as well as full length, because many sequence sets today are less than full-length. It's worth noting that partial sequences are often 'ragged' at the 5' end, because of the use of multiple primers binding at slightly different locations. We might, therefore, want to use the IMGT alignment, and allow users to specify the region (using IMGT co-ordinates) that they want considered in the analysis. This would also provide a reasonable way of comparing repertoires with different-length reads, by specifying a range that is present in both sets.

That would lead to a modified version of (1) - use the query input sequence as mature_seq, and align and trim mature_seq and naive_seq to the region specified for analysis. Throw out any reads that don't cover the required region (providing statistics).

javh commented 5 years ago

My inclination would be to use the sequence_alignment and germline_alignment columns as defined in the AIRR Rearrangement schema (for mature_seq and naive_seq, respectively), as these are required to be aligned and length matched. There are also some optional subregion fields you could use instead (eg, v_sequence_alignment and v_germline_alignment) that are also output by IgBLAST v1.11.0.

In Change-O, the expanded/truncated observed and reference sequence alignment columns are SEQUENCE_IMGT and GERMLINE_IMGT (or GERMLINE_IMGT_D_MASK), which map to sequence_alignment and germline_alignment in the AIRR standard (the latter are not required to be aligned via IMGT numbered, just aligned in some way).

So, if the output of partis is such that mature_seq := sequence_alignment and naive_seq := germline_alignment, then I think you should be fine for partis, raw IgBLAST -outfmt 19 output and Change-O native/AIRR output. I think. You'll probably need some solid examples though.

BrandenOlson commented 5 years ago

Thanks for the input, @williamdlees and @javh . After looking at the AIRR spec and partis's manual, it seems that using mature_seq := sequence_alignment and naive_seq := germline_alignment is the sensible thing to do in terms of the standard and what is already given by igblast, partis, and Change-O. Further, the default in each case is to restrict to the variable region, so let's stick with that too.

I could add an option to specify alternative columns for the "naive" and "mature" which would allow users to have more control of the region being analyzed. Of course, this would put the burden on the user to make sensible choices, but I think it's a reasonable tradeoff given that we can't easily account for and regulate all possible scenarios.

Unless anyone objects, I'll proceed with this (and add this info to the README).

javh commented 5 years ago

I think having arguments to specific the column names for the observed sequence and inferred germline sequence is the right approach. I just wouldn't call them "naive" and "mature", because naive cells can be mature cells. I'd probably go with "sequence" and "germline" for consistency with the AIRR standard.

BrandenOlson commented 5 years ago

Okay, "naive" and "mature" have been renamed "germline" and "sequence" (here), and sumrep uses the germline_alignment and sequence_alignment column definitions by default, but allows the user to override them.