Real-world data sets - Githubissues

matsengrp / sumrep

Summary statistics for repertoires

16 stars 6 forks source link

Real-world data sets #18

Open williamdlees opened 5 years ago

williamdlees commented 5 years ago

This is a marker for the issues we discussed on the AIRR-SW-WG call today.

Some characteristics of real-world data sets:

Many data sets do not cover the entire V-region. It's quite common today for reads to be missing 10 or 15 nucleotides at the 5' end. Other data sets focus on the CDR3 and omit large portions of the V-region. Older data sets (454-based rather than illumina) contain reads of widely varying length.
Some reads in a set will be out of frame, contain N nucleotides, and will have reduced length compared to the set as a whole.

Issues:

Comparing two data sets with reads of different lengths could give results that are strongly biased by the difference in read length rather than by any innate difference in the repertoire characteristics (this is what we saw with Pezhi's first results, I think).
Even if the sets being compared have comparable read lengths, results could be impacted by the presence of short or erroneous reads.

Possible approaches, or components of an approach:

Require all sequences to be full-length and check for this by confirming presence of the CDR3 and alignment against V (and J?) genes
Allow the user to state the regions over which each comparison should be run (e.g. CDR2-FR3-CDR3). Eliminate reads from consideration if they do not have annotated sequence in each of the required regions.
Eliminate reads containing N nucleotides, stop codons. Eliminate out-of-frame reads?
Warn the user if the average read length differs in the two sets being compared.

williamdlees commented 5 years ago

There may be additional benefits from restricting analyses to specific regions. The chemical characteristics of CDRs and FRs can be quite different. It may be useful to confine an analysis to CDR3, for example, even if full-length sequences are available.

BrandenOlson commented 5 years ago

Thanks for the elaborate summary of our discussion, @williamdlees! I'll incorporate some of our email conversation here along with my proposed solution(s).

It appears that much of the issue can be solved by allowing the user to specify custom columns for sequenceColumn and germlineColumn that get passed to shazam::createSubstitutionMatrix through sumrep. In this regard, the solution reduces to those addressed in #13 . One key idea from that conversation is that, while we can try our best, it's impossible for sumrep to ensure that the user is analyzing their repertoires properly. I agree that comparing datasets with reads of different lengths could strongly bias the results, but that seems to be an issue with a user's analysis rather than a flaw with sumrep.

Having said this, I also agree that sumrep should be as explicit with the user as possible with regard to the analysis they are attempting and what assumptions sumrep is making. These functions currently filter out pairs of (germline, sequence) strings that do have matching lengths, but I can also employ the following additional filters:

Remove reads with N nucleotides
Remove reads with stop codons
Remove out-of-frame reads
Remove reads that are shorter than length K, where K can be specified by the user (we should come up with a good default, though)

My thought is to perform each of these filters by default, and tell the user that we are doing so, but allow the user to toggle each one off if they wish. The first one might be tricky as IgBlast, e.g., contains Ns in the D region by default; thus, maybe I should not remove strings with Ns as a default, but allow the option. I can output what percentage of sequences were dropped overall, as well as the difference of average read lengths (perhaps with a warning if this difference is larger than some value?). I'll be careful to document this carefully in the code as well as in the github docs.

Practically speaking, I don't see the integration of IMGT alignments with sumrep happening within the next several months (i.e., by our manuscript submission deadline). The difficulty lies in having to post-process some of the tools' outputs for AIRR-compliance, as well as my current quarter being dominated by TA duties and coursework. However, if there is enough demand for this functionality, I'm happy to put this as a feature for a subsequent release post-submission.

Does this sound like a reasonable approach? Looking forward to hearing further thoughts or suggestions.

javh commented 5 years ago

Personal bias, but I think the easiest approach for IMGT/HighV-QUEST is to just use MakeDb-imgt from Change-O with --format airr to generate the AIRR Rearrangement table.

BrandenOlson commented 5 years ago

@javh - Thanks for the tip! Would this work with any AIRR-formatted table (csv), e.g. one I supply after renaming all of partiss columns?

javh commented 5 years ago

MakeDb is a parser for IMGT/HighV-QUEST, IgBLAST and iMMune-align data, so it'll output an AIRR tsv from the raw data of those alignment tools. We don't have explicit partis support because (a) there is an option for changeo output in partis and (b) adding AIRR support is on the partis todo list (I don't know if this was finished).

Ideally, I imagine sumrep just accepting an AIRR tsv file without caring about the source. Trying to harmonize the custom output of different alignment tools is serious chore...

BrandenOlson commented 5 years ago

Ah, I see. Yeah, that's the approach I've been using for retrieving IgBlast annotations. You're correct that sumrep is able to run using only AIRR-formatted tsv files, although it doesn't yet expect all mandatory columns to be present. Harmonizing custom output is very much a chore, but a necessary evil for the purposes of the manuscript, and also for the partis and IGoR helper functions.

williamdlees commented 5 years ago

Hi Branden,

Thanks very much for the detailed reply. I wrote an initial response just now, which may have reached you by email, but subsequently deleted it because I overlooked the germline_alignment field.

Do you have a spec of which functions use 'sequence', which use 'sequence_alignment' and which use 'germline_alignment'? I'm finding it hard to propose a solution without knowing a little bit more about that.

williamdlees commented 5 years ago

Apologies for the multiple messages. I looked at the code and it seems to me that you use 'sequence_alignment' and 'germline_alignment' in functions that explicitly reference the germline, and 'sequence' in those that don't. Is that correct? Are gapped alignments accepted? Is there a particular reason not to use sequence_alignment throughout?

BrandenOlson commented 5 years ago

Hi William,

Sorry for the confusion here.

It might be helpful to revisit our spreadsheet from a while back which details each of our summary statistics and partitions them into four levels of assumption -- naked sequences, V(D)J-aligned sequences, inferred clonal families, and inferred phylogenetic trees.

So, any statistic in the first level (naked sequences) makes use of the sequence column by default. Of course, there is nothing stopping the user from providing other columns instead, such as sequence_alignment, but I believe our rationale (@matsen, please chime in if I'm missing something) is that these summaries don't really need further assumptions to have meaningful interpretation. It is also nice to have summaries that require zero annotations or inferences at all, for a quick-and-dirty look at a dataset. You could object that these statistics are highly dependent on read lengths and the nature of the sequences, but I would argue that is a benefit as it gives the user a good idea of some differences in their datasets are without the need to run an annotation tool.

sequence_alignment is used for level 2+ summaries since we have assumed an alignment for the sequences in these levels. germline_alignment is really only used in mutation-related sequences, in particular, getDistancesFromGermlineToSequence, getPositionalDistanceBetweenMutationsDistribution, getSubstitutionModel, getMutabilityModel, and getSelectionEstimate, and is always accompanied by the sequence_alignment counterpart (all of these require at least level 2 assumptions as well.)

The assumptions under which each summary is operating are included in the documentation, but I could add some more text explaining which columns are expected for each category, or even another table that maps assumptions to column defaults. Perhaps the most thorough thing to do would be include an "Expected columns" in the summary functions table; the main downside I see here is that it will be easy forget to maintain this anytime a default changes, but that shouldn't be an excuse not to do it. It's also fairly easy for the user to enter the command help(getXDistribution) to see what the default column is for the function, and I can emphasize this somewhere in the docs.

Gapped alignments are not explicitly supported and would likely not be until post-manuscript, though I'm not sure I see the benefit over using sequence_alignment as the gaps might cause parsing problems with some/all of the statistics. Happy to hear your reason(s) to support them, though.

I hope this clears up some of the confusion -- looking forward to your response!

williamdlees commented 5 years ago

Thanks Branden, that's very helpful. I see that you use the aligned sequences with some shazam functions. The mutability matrix and substitution matrix functions require IMGT-gapped alignments (by the way, the docs also warn explicitly that ambiguous nucleotides must not be present). I am not sure how well those functions would work with some other alignment, but given their nature I'd expect that they would at least require in-frame codon-based alignments. The docs for calcBaseline are not explicit but I suspect it may also be expecting an IMGT alignment.

Perhaps there is some marginal utility in being able to run the tool on unannotated raw sequences but I can't see it myself. What use is a distance comparison when the sequences compared could vary considerably in length, with some extending potentially into the UTRs and others into the constant region? Whatever benefit there might be is, to me, greatly outweighed by the corresponding difficulty today in running the length and hotspot analyses on a defined region following annotation.

A number of functions calculate only over the junction - GRAVY, Atchley and so on. I think that needs to be drawn out a little more in the documentation.

I suggest that the documentation should explicitly state that the aligned sequence and germline must be IMGT-aligned, and that analyses (such as distance calculations) that won't handle a gapped alignment are run on a de-gapped copy of the aligned sequence, rather than the raw sequence. This is a small change that would mainly affect the documentation, but it would ensure that analyses are run on a defined region - the variable region - which is the region that's really of interest for the time being.

A small enhancement, which would address the length issue, would be to allow the user to specify a mask, and to run calculations on that mask rather than on the entire aligned sequence. For example the user could say that they wanted calculations run on codons 56-65 of the IMGT-aligned sequences (which happens to correspond to IMGT CDR2). That would also allow (as a further enhancement) those calculations which are currently restricted to the junction to be run on user-defined regions, which, to me, would be a useful extension.

Putting the column dependencies into the help as opposed to higher-level documentation feels like a good way to go, particularly if they can be automatically derived from the code in some way so that they are guaranteed to be up to date. But we should also try to reduce the hard-coded dependencies on specific columns as much as possible and I've tried to suggest a way that could be done with relatively small impact on the code base. Let me know what you think.

All the best

William

matsen commented 5 years ago

Dear William--

Thank you very much for thinking about these issues.

I wanted to clarify that our original intention in taking unaligned sequences was to have some statistics that could not be biased by further processing. For example, aligned sequences can be messed up by having the wrong germline set. However, I think that you are right that utility overrides principles in this case. I personally hadn't thought about length heterogeneity, and there's no real way to deal with that without having alignments.

Branden and I chatted and he's going to look into moving into using the aligned sequences here. He may be reaching out to you for help.

Again, thanks.

williamdlees commented 5 years ago

Thanks Erick, and Branden. I can understand the concern re. bias. Very happy to help in any way I can to come up with a straightforward approach to address all these issues, which I think is achievable.

javh commented 5 years ago

I see that you use the aligned sequences with some shazam functions. The mutability matrix and substitution matrix functions require IMGT-gapped alignments (by the way, the docs also warn explicitly that ambiguous nucleotides must not be present). I am not sure how well those functions would work with some other alignment, but given their nature I'd expect that they would at least require in-frame codon-based alignments. The docs for calcBaseline are not explicit but I suspect it may also be expecting an IMGT alignment.

Just to chime in briefly here, most things in alakazam/shazam expect IMGT numbered sequences, unless they operate exclusively on the junction region (in which case they require identical length junctions).

You can get around the IMGT numbering requirement in calcBaseline by either defining a custom shazam::RegionDefinition object that defines the V, CDR and FWR boundaries in your multiple alignment or by setting the argument regionDefinition=NULL (ie, having only one region - the full sequence).

The RegionDefinition system isn't part of the 5-mer model building functions, but I think padding the multiple alignment out to 312 characters will be sufficient to get it to run without IMGT numbering. The multiple alignment requirement is really the important part. IMGT numbering just happens to be a very useful/inexpensive/reproducible way to get that alignment. But we should be able to figure out some hacks if you want to use a different multiple alignment strategy. Well, I think so anyway (haven't tested it).

javh commented 5 years ago

I may be wrong about the 5-mer functions needing a multiple alignment... A pairwise alignment between the inferred germline and observed sequence might be sufficient, as long as the non-CDR3 V segment ends at 312. I'll have to check.

BrandenOlson commented 5 years ago

Thanks for the information, @javh ! I'll address the shazam stuff in change 3 below.

In light of this along with @williamdlees 's suggestions, @matsen and I propose the following changes:

Redefining level 1 to comprise "aligned sequences", and making sequence_alignment the default column for getPairwiseDistanceDistribution, getNearestNeighborDistribution, getGCContentDistribution, getHotspotCountDistribution, and getColdspotCountDistribution, rather than sequence. I'm less inclined to require that sequences be IMGT-aligned specifically, as long as they fit the field definitions laid out in the AIRR standard. shazam functions are an exception, discussed in change 3.
Moving the assumption-less level of raw query sequences to a "level 0" which will be supported in sumrep, but not the default level for any of the summaries. Perhaps we can define some statistics for this level, such as read length distribution, but this is not a priority right now.
Modifying the shazam functions to expect IMGT-gapped sequences; I'm happy to commit to the defaults expected by shazam rather than hacking out a more general solution. I don't see columns for IMGT-gapped sequences/germlines in the AIRR docs, and it seems like shazam expects non-AIRR column names here, but I could be mistaken. Expecting sequence_imgt and germline_imgt columns seems consistent with the standard, if such fields aren't included yet. In any case, we likely need to exclude shazam functions in the manuscript analyses as it doesn't sound trivial to manually incorporate these new fields into partis and igor.
Explaining in the documentation that sumrep uses junction as the default column for amino acid related summaries, but can accept other columns that contain amino acid strings (with a cautionary note to the user, of course).
Implementing and documenting the filters I described in my first comment above.

While I understand the appeal of the position masking feature, I'm hesitant to incorporate this approach for a couple of reasons:

As the shazam summaries appear to be the only ones which require IMGT gapping, it seems unnecessary to require these columns be present for general-purpose usage. I want users without IMGT-gapped sequences to be able to run things with minimal hassle, and requiring sequence_alignment and germline_alignment fields seems like a sensible default.
For those users who do wish to restrict their analyses to certain regions, allowing them to provide custom fields sounds more straightforward, more general, and less error-prone (I'm always wary of working with string indices, especially on someone else's behalf). I could include a snippet somewhere detailing how this could be done using IMGT-gapped strings, which would illustrate the idea in a principled way.

As always, I look forward to any comments or feedback.

javh commented 5 years ago

I don't see columns for IMGT-gapped sequences/germlines in the AIRR docs, and it seems like shazam expects non-AIRR column names here, but I could be mistaken. Expecting sequence_imgt and germline_imgt columns seems consistent with the standard, if such fields aren't included yet. In any case, we likely need to exclude shazam functions in the manuscript analyses as it doesn't sound trivial to manually incorporate these new fields into partis and igor

These columns would be the sequence_alignment and germline_alignment columns in the AIRR standard. The AIRR schema doesn't specify the method by which the *_alignment sequences are aligned - it only requires that they are properly aligned in some fashion. So this could be IMGT numbering, KABAT, a pairwise alignment within rows, a multiple alignment across rows/columns, etc. (If you generate an AIRR tsv via changeo.MakeDb the sequence_alignment and germline_alignment columns will contain IMGT-gaps.)

The alakazam/shazam functions default to Change-O style column names, but they should all take an argument letting you specify the column name and use the AIRR defined names through that mechanism. If one of them doesn't have an argument for the input fields, then just drop an issue into the appropriate Bitbucket repo and someone will fix it.

williamdlees commented 5 years ago

Thanks Branden.

It would be worth specifying sequence_alignment closely, I think. For example

will you accept both - and . as gap characters (as IMGT seems to)? Will the functions (such as distance) handle gaps correctly (I guess in many cases this means removing gaps before processing)?
What region should the alignment cover? Is it always the V-region, or, if shorter reads have been used, can it be shortened?
What character should be used in a position in the alignment which is not covered by the read? To me the 'obvious' anser is a gap character, but sometimes I've seen it filled with Ns, for example after primers have been masked.

I'm not sure the particular answers to these questions matter that much but I do think it's important to document them.

As we've discussed, it's important that the sequences in both sets are compared over a span that is covered by the reads in both sets. I think it's reasonable to me that we leave it to the user to ensure this by choosing the region spanned by sequence_alignment and masking it appropriately, but it would be helpful to warn, for example by comparing average read lengths in the two alignments.

javh commented 5 years ago

As we've discussed, it's important that the sequences in both sets are compared over a span that is covered by the reads in both sets. I think it's reasonable to me that we leave it to the user to ensure this by choosing the region spanned by sequence_alignment and masking it appropriately, but it would be helpful to warn, for example by comparing average read lengths in the two alignments.

We added a few functions to alakazam to deal with these situations (eg, maskSeqEnds which creates equal end masking to avoid artifacts of low "coverage" positions, padSeqEnds, etc).

I definitely think we should be wary of feature bloat in sumrep, but some preprocessing to make sure sequences are comparable seems fair.

As for gaps, we treat ., - and N as equivalent in most cases, but that's not universal and depends on the specific analyses. I don't know that there's a one-size-fits all solution. But, unless you want to look at indels specifically, I think it's probably okay to treat them all the same.

(For indels, you might instead want to use - specifically for indels, N specifically for ambiguous base calls, codon-length in-frame ... as excluded positions, and truncated codon ./.. as equivalent to -/--.)

BrandenOlson commented 5 years ago

Ok -- after lots of helpful discussion with @matsen, @javh, and @williamdlees, I think we're ready to commit to a specification of sequence_alignment and germline_alignment:

The recommended specification of sequence_alignment will be the aligned portion of the query sequence with IMGT gaps, and constrained to the V(D)J region. Similarly, the recommended specification of germline_alignment will be the assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field, with IMGT gaps.
If IMGT gaps are not available, we will still recommend that the sequences in sequence_alignment be pairwise aligned and constrained to the V(D)J region, and that germline_alignment adhere to the same properties as above, just without the IMGT gaps.
We will allow the user to specify custom columns to any function which uses sequence_alignment and/or germline_alignment as defaults, with a warning that the results are sensible insofar as these sequences are well-defined and compared over a span that is covered by the reads in both sets.

I think treating both . and - as gaps should be fine, unless there are any heavy objections. Some functions (mainly distance-based ones) will remove all gap characters before computing the summaries, whereas others will still give the correct result with gaps (e.g. GC content, hotspots). I'll include this case-by-case in the documentation.

What character should be used in a position in the alignment which is not covered by the read? To me the 'obvious' anser is a gap character, but sometimes I've seen it filled with Ns, for example after primers have been masked.

I think gaps make the most sense in this situation. For the purposes of sumrep, positions not covered by the read will not impact any of the summaries, so these positions are only of interest if the user wants to mask sequences to a particular region. Masking is still something sumrep won't do for the user, although I can include a remark/example somewhere on how this can be done.

it would be helpful to warn, for example by comparing average read lengths in the two alignments

This is a great idea, although I'll put it on the back-burner for now. (I'll file an issue to make sure I don't forget about it!)

Looking forward to any feedback!

javh commented 5 years ago

Sounds good to me. The only reason to treat -and . differently is if you want to count indels as mutations, which I don't think you want to do in sumrep (correct me if I'm wrong). In that case, you'd need a preprocessing step to replace non-codon/out-of-frame . characters with an appropriate number of - characters. Eg:

GERMLINE>   ATG...TGCT
SEQUENCE>   ATG....GCT
VALID-GERM> ATGTGCT
VALID-SEQ>  ATG-GCT

VALID being what you actually analyze.

If this is out of score for sumrep, which I think it is, then I think ., - and N all being equivalent is a good idea.

williamdlees commented 5 years ago

Thanks Branden, your outlined approach sounds good to me.

William