How do we handle multiple V/D/J calls

bcorrie commented 4 years ago

Some annotation tools provide multiple VDJ calls. How does the Stats API handle this?

bcorrie commented 4 years ago

There seem to be three options here, when considering a multiple gene call. If we consider asking for v_call stats on a repertoire that consists of one rearrangement that has v_call = IGHV5-51*01, IGHV5-51*03 I could see the following possible results.

{"field": "v_call", "data": [ { "key": "IGHV5-51*01", "count": 1}]}
- take one of the two (arbitrarily the first)

{"field": "v_call", "data": [ { "key": "IGHV5-51*01", "count": 1}, { "key": "IGHV5-51*03", "count": 1} ]}
- count both separately - means the counts will be higher than the number of rearrangements.

{"field": "v_call", "data": [ { "key": "IGHV5-51*01, IGHV5-51*03", "count": 1}]}
- count the pair as a single entity

Sooooo, which if any of these is the correct thing to do?

schristley commented 4 years ago

I think this is only an issue for rearrangements. When clonal assignment is done, a single v_call will be chosen.

Another possibility is to use the average for the count. That will insure the total count stays the same.

{"field": "v_call", "data": [ { "key": "IGHV5-51*01", "count": 0.5}, { "key": "IGHV5-51*03", "count": 0.5} ]}

bussec commented 4 years ago

Not in favor of option 2 (count both separately with normal weight) as I do not see any benefit or use case in this. Is option 1 supposed to be deterministic, i.e. the same gene call should be picked every time?

bcorrie commented 4 years ago

Is option 1 supposed to be deterministic, i.e. the same gene call should be picked every time?

I am not sure how deterministic this could be? If we assumed that the first one in the list was the one with the "highest" score (assuming such scoring occurs in the annotation tool) and the repository always counted that one, then great. But the AIRR spec doesn't really specify how such a multi-gene call should be made, so I am not sure that works.

One thing I have no clue about is whether treating it as a unit makes any real biological sense...

If a sequence is annotated the same by the annotation tool (e.g. v_call = IGHV5-51*01, IGHV5-51*03) does it make sense to count those as a unit? If the above annotation happened for 20% of the rearrangements, would the researcher want to know that...

I suspect they might, as the annotation tool is consistently confused after all. Presumably that isn't a sequencing error (if it happens 20% of the time) and that there is something in the data that is prominent but doesn't quite fit the germline database that was used? Maybe a new allele?

schristley commented 4 years ago

the first one in the list was the one with the "highest" score

There should only be multiple genes because they all have identical scores, and thus the tool cannot rank them. Instead of arbitrarily picking one, the tool is "passing the buck" and just providing all the equally possible choices.

A common reason for this to happen is the sequencing does not cover enough of the V gene to allow them to be distinguished. For analyses I've done, I show gene segment usage at the family or gene level which avoids the ambiguity.

bcorrie commented 4 years ago

So which of these look better, assuming we had one call with IGHV5-51*01 and one call with IGHV5-51*01, IGHV5-51*03:

{"field": "v_call", "total":2, "data": [ { "key": "IGHV5-51*01", "count": 1.5}, { "key": "IGHV5-51*03", "count": 0.5} ]}

Or

{"field": "v_call", "total":2, "data": [ { "key": "IGHV5-51*01", "count": 1}, { "key": "IGHV5-51*01, IGHV5-51*03", "count": 1} ]}

schristley commented 4 years ago

In the last SciTech meeting, we discussed that the stats API should probably provide the counts in raw form, and not process them too much. For example, providing an average...

Thinking on this, handling multiple calls as a single entity is fairly complicated too. What if a rearrangement has 2 gene calls, and another has 3 gene calls, where 2 gene calls are shared? Are they treated as separate entities? As Gur mentioned, some tools have a parameter which specifies the max gene calls to report, so the difference between those two rearrangements could just be how the tool was called? Plus this can create a large number of entities that the client has to sort through.

My suggestion is we avoid all that complexity and just do something simple. For each gene, we return two counts. One count is when the gene is the single lone gene call. The other count is when the gene is in a list of multiple gene calls. The client can decide if it wants to add those two numbers together, or display them separately.

Another slight alternative is the first count is when the gene is in the gene call, for both single and multiple calls. The other count stays the same. The client can subtract the one count from the other if desired. It's arithmetically similar just depends what count we consider as the "default".

If the client wants all the detail about the multiple gene calls, it can query the rearrangement data.

bzimonja commented 4 years ago

I'm just going over the full set of stats for one of our repositories, so this bubbled back in my m,ind.

If I understand correctly, each v/d/j element (let's say IGHV01-3*01) has two counts returned for a repertoire, like so:

field: "vcall_unique" data; [ (key: "IGHV01-3*01" count: 17) ]

and then later

field: "vcall_all" data:[ (key: "IGHV01-3*01" count: 27) ]

Or would it be simpler like this

"field": "v_call" "data":[ key: "IGHV01-3*01" count_unique: 17 count_shared: 27 ]

bcorrie commented 4 years ago

I think I would lean towards separating these, and maybe even consider them separate requests to the API. That is, I could ask for the following stats:

v_call: lets assume we decide the default is that the key is in the list of alleles it counts. So this returns 27
v_call_unique: this is a different stat, and we implement this explicitly as a separate gene stat. This returns 17
v_call_distributed: this gives us the ability to easily expand the gene stats API to include something more sophisticated, such as something that does the averaging that we talked about above. That is for a v_call of IGHV5-51*01, IGHV5-51*03 it gives a score of 0.5 to each.

In this way we have one default stat, with the name of the field (we just need to decide what the default is) and then we can have other stat capabilities as required.

Also, in this way, the stats response is the same for all stats, and it is just the stats request that is different. That is, the API response for v_call, v_call_unique, and v_call_distributed would be of exactly the same structure...

Thoughts?

bcorrie commented 4 years ago

The above is also nice because it would only apply to "weird" fields where we have this odd case where several stats may be desired, but for MOST fields, there would be one and only one stat. At least I think that seems like the immediate case. Of course this model would not tie us to that restriction, so if we wanted many stats on gene, or junction_length, or whatever, it is simple a controlled vocabulary "key" that needs to be changed in the API.

Of course the repository has to implement these things - but at least the API is as "clean" as possible...

schristley commented 4 years ago

That is, the API response for v_call, v_call_unique, and v_call_distributed would be of exactly the same structure.

I agree that's clean and simple. We could even do that with productive but then maybe the number of keys starts getting out of hand.

bcorrie commented 4 years ago

I think better to have that than make the responses clunky 8-)

I can see how you might get to a fair number of stats fairly quickly, but in a way that is what we want... We want it to be easy to add a capability (at least from the API perspective) for a new stat.

I think if things start to blow up for a specific stat, we are started to say we want so much flexibility for that specific stat that maybe it needs to be come an "Analysis" rather than an interactive "Stat". At that point we move the responsibility from the repository doing something interactively to an asynchronous "Analysis".

bcorrie commented 4 years ago

That is, the API response for v_call, v_call_unique, and v_call_distributed would be of exactly the same structure.

I agree that's clean and simple. We could even do that with productive but then maybe the number of keys starts getting out of hand.

I implemented this for productive and all of the current stats. (See #7)

We need to determine what the default behaviour for v_call should be as it is defined. For ease of use, I might suggest that the default behaviour might be that if the string exists in the v_call then it is counted as 1. This will mean that the sum of the counts will be larger than the total number of v_calls, but I think this is a valid stat. Whether it is the correct one for the default, I am not sure.

We can then decide if we want new stats and if so what they are - like:

v_call_unique - count the number of v_call where the call is unabiguous.
v_call_distributed - count using a distributed/average score in the sense that if there is an ambiguous call of N different alleles for a specific call, then the count is incremented by 1/N for each allele.
v_call_combined - count a v_call of IGHV1-101, IGHV1-102 as a single entity, so this call would not contribute to IGHV1-101 count or IGHV1-102 count, but would give you a count of the number of times these two appeared together.

Etc.

bzimonja commented 4 years ago

If we want one for default, I think it should be unique (so total number of v_calls including null adds up to total number of rearrangements), with ambiguous ones handled the way V-Quest does them

"IGHV1-01, or IGHV1-02"

As long as we are consistent with the order (alphabetical should do it), this will allow analysis app to post-process them the way user wants.

yyweiss commented 4 years ago

There are three types of multiple V calls.

(1) Different genes which have identical alleles. There are a number of such genes in the heavy chains of human B cells. (The IGHV3-30(-*) series are mostly responsible for the non duplicate genes.)

(2) Genes which have the same "tail", so sequences which don't start at the beginning of the V gene (by protocol design or a problematic read) cannot distinguish between such genes.

(3) Sequences with mutations, which are equally close to multiple genes.

Types 1 and 2 can also apply to clones.

From initial analysis, it seems that 2 is not that big a problem on humans, except for very short reads. (Skipping more than 200 nucleotides from the beginning.)

In any case, the default action should be the third option. (count the pair as a single entity) You can easily transform that result into the other options.

The maximum number of results shown by igblast (or other tools for calling) can affect the results. That is a bigger problem with allele counts than with gene counts.

schristley commented 4 years ago

We need to determine what the default behaviour for v_call should be as it is defined.

These seem the two simplest to start with

v_call_unique - count the number of v_call where the call is unambiguous.
v_call_exist - count the number of v_call where it exists, either single or in a list.

we can add v_call_combined, but I'm not sure it's useful from a Stats API perspective, i.e. what's the use case for visualizing a simple statistic, versus say using it for analysis? Likewise, the implementation to make sure you are counting groups correctly could get complicated.

bzimonja commented 4 years ago

Looking at #3 does that mean we'd end up with

v_call_unique v_call_unique_productive v_call_exists v_call_exists_productive

And likewise for family/gene, for all VDJC?

schristley commented 4 years ago

And likewise for family/gene, for all VDJC?

Yes.

bcorrie commented 4 years ago

Do we map one of these to v_call as the default, or, if we are naming specific stats, maybe we don't have a default - a default could be confusing if there is no obvious "best" default".

If we have a default, its behaviour needs to be clearly documented...

bcorrie commented 4 years ago

My commit auto closed this issue, which might be a bit premature. @schristley and @bzimonja can you have a look at https://github.com/ireceptor-plus/specifications/commit/fd2f09b418ef56b7da703141d5042961993c2a30 to ensure that I have the set of stats correct. I think I did it as we discussed, but good to have another set of eyes on this.

bcorrie commented 4 years ago

Once you have a look and give a thumbs up, I will close this issue.

schristley commented 4 years ago

@bcorrie Looks good, I reorganized the list so I could follow the groups easier

bcorrie commented 3 years ago

Great, closing this issue...

ireceptor-plus / specifications

How do we handle multiple V/D/J calls #2