Does Stats API require repertoire_id, data_processing_id, and sample_processing_id?

ireceptor-plus / specifications

Central repository for all iReceptor+ specifications

0 stars 0 forks source link

Does Stats API require repertoire_id, data_processing_id, and sample_processing_id? #1

Closed bcorrie closed 4 years ago

bcorrie commented 4 years ago

Started from a discussion around the iReceptor implementation of the Stats API. The Stats API currently has all three provided to the API:

https://github.com/ireceptor-plus/specifications/blob/ae9b69a190e5e4b0141acadc55a88b202c9ad231/stats-api.yaml#L219

Are all three required... Please discuss 8-)

bcorrie commented 4 years ago

I believe that we need all three, as there are some cases where you can't link a set of rearrangements that have been processed in a specific way without providing all three _ids, but worth revisiting since our Stats API works at the Repertoire level.

schristley commented 4 years ago

My initial understanding is that the API should return stats on "all of the rearrangements for the repertoire", but this needs to be defined more precisely, preferably with an exact query. So one definition is:

{
  "filters": {
    "op": "=",
    "content": {
      "field": "repertoire_id",
      "value": "rep123"
    }
  }
}

but this mixes rearrangements from different data processing pipelines. As far as I know, there is no biological use case to mix rearrangements across data processing, compare yes, but not combine or mix. That implies we should also restrict by the data_processing_id

{
  "filters": {
    "op": "and",
    "content": [
      {
        "op": "=",
        "content": {
           "field": "repertoire_id",
           "value": "rep123"
      },
      {
        "op": "=",
        "content": {
           "field": "data_processing_id",
           "value": "data123"
        }
      }
    ]
  }
}

Now this gives us all the rearrangements that were processed together, and is the basis for doing analysis on a repertoire.

So if we stop at "all of the rearrangements for the repertoire", then we are done. However, it might be useful to provide statistics on subsets of all rearrangements. In fact, most scientific use cases will require some additional filtering. For example, getting the gene usage distribution on both the productive and non-productive combined does not really make scientific sense. The basic use case is to get the gene usage for just the productive rearrangements. That would suggest we allow a filter on productive.

There are also other filters that might be useful. For example, if you want statistics on just one sequencing file (and/or paired-end files), that corresponds to sample_processing_id.

(Note, digression: the word "sample" is overloaded, sample_processing_id might be better to be called raw_sequence_data_id because it represents the data in a single sequencing file (and/or paired-end files) according to the AIRR RawSequenceData object in SampleProcessing. In contrast biologists/clinicians often talk about comparing "samples", analyzing "samples" and so forth. That "sample" concept has been "renamed" by AIRR to Repertoire, because biologists/clinicians don't care about specific sequencing files, they care about analysis and comparison of that higher-level concept. The AIRR Repertoire allows multiple sequencing files/run and experimental protocols to be combined together and considered a consistent whole, while both hiding but keeping provenance about the underlying technicalities. end digression).

So it might be useful to allow sample_processing_id to be a filter. The only issue is that sample_processing_id is an optional field, so not all data sets may honor it. If the researcher really wanted to compare statistics between two (or more) sequencing files, they should create additional repertoires for them. That will allow them to be compared and analyzed without worrying if sample_processing_id is properly supported by the data set (and/or the data processing that was performed).

Another possible filter is locus, so when you ask for v_call, you can get (say) just TRB without also getting TRA, and so on.

But we should think about how to do filters, do we customize a set for each endpoint? Do we allow a basic set of filters across all endpoints? Should we only define filters on required fields so that we know there will be reasonable values in that field? Maybe there are some optional fields that are useful too as filters.

bcorrie commented 4 years ago

Another possible filter is locus, so when you ask for v_call, you can get (say) just TRB without also getting TRA, and so on.

Our initial plan for the Stats API at the repository level was to not provide for the ability to filter for sequence/rearrangement features such as productive and locus. My understanding of what we wanted to do (see the Google Doc discussion) was to have this fast and interactive, likely caching the stats so that the response would almost be immediate.

So the filters in the current API are the _id fields that are used to link rearrangements to repertoire objects.

schristley commented 4 years ago

Our initial plan for the Stats API at the repository level was to not provide for the ability to filter for sequence/rearrangement features such as productive and locus. My understanding of what we wanted to do (see the Google Doc discussion) was to have this fast and interactive, likely caching the stats so that the response would almost be immediate.

I understand that. I was just presenting alternatives that would make the API more scientifically useful. As far as I'm concerned there's really no point in providing a statistic like gene usage on a combination of productive and non-productive rearrangements, so I definitely would not present that to a VDJServer user for example as it would just be confusing and meaningless. I'd have to figure out another way.

So the filters in the current API are the _id fields that are used to link rearrangements to repertoire objects.

Does this include cell_id, clone_id and etc?

bcorrie commented 4 years ago

So the filters in the current API are the _id fields that are used to link rearrangements to repertoire objects.

Does this include cell_id, clone_id and etc?

My understanding was that we were going to ask for Stats for either rearrangements, clones, or cells separately - that is there are stats entry points for each of them, and for each you would provide the _id fields for those entities that link them. See the SciTech decision from 2019-10-16. We would have things like:

/rearrangements/gene_usage /clones/gene_usage /cell/gene_usage

Each of these would have as query parameters the _id's that link repertoires to these "observed" objects.

schristley commented 4 years ago

I believe that we need all three, as there are some cases where you can't link a set of rearrangements that have been processed in a specific way without providing all three _ids, but worth revisiting since our Stats API works at the Repertoire level.

So this gets us back to what I was trying to understand in the original ticket. The vague "we need all three" doesn't precisely define what rearrangements need to be operated upon to define the statistic. Is it all possible combinations of those three _ids? Some combinations? Which combinations do we have to support? You don't seem to think my definition for "all the rearrangements for the repertoire" is sufficient, but I don't know your definition so I need you to precisely define that in the context of this API, preferably with queries so I know exactly what data to calculate upon.

bcorrie commented 4 years ago

I believe they are cases when you are going to want to provide a statistic selecting rearrangements using one (repertoire_id), two (repertoire_id, and one of data_processing_id OR sample_processing_id), or three (all of them) of the _id fields. So I would say we need to support all combinations... That is currently how the draft spec in the WP2 github was worded, including some descriptions which I have changed a bit in the current spec.

Have a look at the wording around the ID fields for the API parameters and see if they make sense.

One question I had is that whether we should use the primary DataProcessing if no data processing is provided or if we should consider that to be the case when you want to perform combined stats on all of the DataProcessings in that Repertoire? In a way, that doesn't really make sense to me. If you had a data set that was annotated with mixcr and igblast (two DataProcessing objects) would you ever want to do a combined V Gene usage analysis across the annotations from both?

schristley commented 4 years ago

I'm (roughly) thinking of combinations this way, each identifier has three possibilities:

identifier is not sent in the request.
identifier is sent in the request with an actual value.
identifier is sent in the request with a null value.

With 3 possibilities and 3 identifiers that give 3 3 3=27 combinations. For sake of argument, let's say those possibilities imply these types of queries.

queries all possible values for that identifier.
queries just that specific identifier value.
queries just records where that identifier is null (i.e. "is missing" operator)

For repertoire_id, (1) and (3) don't make (scientific) sense for most statistics. Exception being (1) for /count endpoints as that's not really an analysis statistic, just an interesting number to provide. If you agree, then this implies repertoire_id in the request should be required and nullable:false.

That reduces the combinations down to 1 3 3=9 combinations.

For data_processing_id, (1) and (3) also don't make (scientific) sense for most statistics. So to answer your question

would you ever want to do a combined V Gene usage analysis across the annotations from both?

My answer is no. That's why I stated before

As far as I know, there is no biological use case to mix rearrangements across data processing, compare yes, but not combine or mix.

However, at the same time, data_processing_id is a funky case because we have custom semantics associated with it. I just detailed this in ticket and I think we should get rid of it. In particular for the Stats API, I don't think we should adopt those semantics but instead treat it just like any other field and not try to interpret the "primary annotation".

If you agree, then this implies data_processing_id in the request should be required and nullable:false. This gets us to what I was trying to convey in the comment above as "all of the rearrangements for the repertoire".

That reduces the combinations down to 1 1 3 = 3 combinations.

Lastly for sample_processing_id, (1) makes scientific sense but (2) and (3) are questionable, and this I think where we've been talking past each other. So I'm okay with supporting (2) but just don't call that a "repertoire", to be more precise call it a "sample processing subset of a repertoire" or something like. It is a subset, specifically you are asking for statistics for a single sequencing run. So yes, I can see how you might want to look at the gene usage for one sequencing run and compare it with another sequencing run, but I feel that is abusing the notion somewhat... If you really want to do that, define each sequencing run as a separate repertoire then compare repertoires, because repertoires are the unit of analysis, not sample processing records. But fine, these are just statistics, I can live with (2) if you want.

Though, this is partially why I feel your statement

to not provide for the ability to filter for sequence/rearrangement features.

isn't completely right. With sample_processing_id, that's essentially what we are doing. We are filtering the repertoire for only rearrangements that are part of single sequencing run. So if we can do that filtering, why can't we do others? Anyways we can debate that later.

But there's more, just like the infomercial ;-D, sample_processing_id is the odd ball because we cannot guarantee that all data processings can assign a sample_processing_id to their rearrangements. This is also what makes (2) and (3) questionable. Let's detail the two scenarios

S1: the data processing support sample_processing_id so rearrangement records have a sample_processing_id that provides a one-to-one mapping back to the record in the study metadata.
S2: the data processing does not support sample_processing_id so rearrangements records have a null or non-existent sample_processing_id. There still could be sample_processing_ids in the study metadata though.

If you do (2) on a repertoire for (S2), the statistics will come back empty. But yet if you do (2) on a repertoire for (S1), you will get statistics. How is the client to know whether they can use sample_processing_id or not to get statistics? They have to somehow know if the data processing supports it or not, i.e. if (S1) or (S2).

Those scenarios also make (3) ambiguous because in for (S2) this will return all the rearrangements, this is essentially the same as (1). But for (S1), then (3) will likely return no rearrangements because all of them have sample_processsing_id's associated with them. So again, how is the client to know?

This is why I think (2) and (3) are questionable, but I'm okay with (2) if you want, but (3) seems worthless. Just know that when you work on iR+ gateway, you'll need to handle (S1) and (S2) in a way that doesn't confuse the user.

So that leaves us with either 1 or 2 combinations to support. If 1 combination, then we eliminate sample_processing_id from the interface. If 2 combinations, then sample_processing_id is not required and nullable:false

bcorrie commented 4 years ago

to not provide for the ability to filter for sequence/rearrangement features.

isn't completely right. With sample_processing_id, that's essentially what we are doing. We are filtering the repertoire for only rearrangements that are part of single sequencing run. So if we can do that filtering, why can't we do others? Anyways we can debate that later.

sample_processing_id is an ID field that links a Rearrangement back to the Repertoire level object. It isn't a sequence/rearrangement feature that is assigned by the annotation tool. So it is fundamentally different than say productive or v_call. I think the Stats API needs to support filtering of some kind on all of the linking ID fields between Rearrangement and Repertoire. This holds true for both sample_processing_id and data_processing_id.

I think we might need to consider filtering on some rearrangement level features like productive (see #3) but I think that decision is fundamentally different than filtering on an ID linking field.

bcorrie commented 4 years ago

However, at the same time, data_processing_id is a funky case because we have custom semantics associated with it. I just detailed this in ticket and I think we should get rid of it. In particular for the Stats API, I don't think we should adopt those semantics but instead treat it just like any other field and not try to interpret the "primary annotation".

Yep, no objection there - at least for special interpretation in the Stats API - that is confusing. I think I agree with your suggesting in the AIRR ticket as well, but I need to think about that a bit more. 8-)

schristley commented 4 years ago

sample_processing_id is an ID field that links a Rearrangement back to the Repertoire level object. It isn't a sequence/rearrangement feature that is assigned by the annotation tool. So it is fundamentally different than say productive or v_call.

Though, it's not a guaranteed link like repertoire_id and data_processing_id, which I think puts it in its own special category, but I'm with you, it's conceptually different from productive even if technically the query operates the same.

bcorrie commented 4 years ago

Lastly for sample_processing_id, (1) makes scientific sense but (2) and (3) are questionable, and this I think where we've been talking past each other. So I'm okay with supporting (2) but just don't call that a "repertoire", to be more precise call it a "sample processing subset of a repertoire" or something like.

I think this is very much the same as data_processing_id. When you filter with data_processing_id you are getting a "data processing subset of a repertoire" just like when you when you filter with sample_processing_id you are getting a "sample processing subset of a repertoire".

It is a subset, specifically you are asking for statistics for a single sequencing run.

The sample_processing_id isn't just to differentiate sequencing runs. Remember that if database stores a single repertoire that has been cell sorted so that the repertoire has two different types of T cells, you will have two different SampleProcessing objects and therefore two different sample_processing_ids. In this case, I think it would be common to want to be able to perform stats and compare the V Gene usage between these two different types of T Cells, which are once again "sample processing subsets of a repertoire". So you need your 2) above.

schristley commented 4 years ago

Lastly for sample_processing_id, (1) makes scientific sense but (2) and (3) are questionable, and this I think where we've been talking past each other. So I'm okay with supporting (2) but just don't call that a "repertoire", to be more precise call it a "sample processing subset of a repertoire" or something like.

I think this is very much the same as data_processing_id. When you filter with data_processing_id you are getting a "data processing subset of a repertoire" just like when you when you filter with sample_processing_id you are getting a "sample processing subset of a repertoire".

The syntax is the same but not the semantics. There is no scientific reason I can see to combine data across two or more data_processing_ids. That's combining Mixcr and Igblast annotations together, it's technically possible but not semantically useful. On the hand, you are suppose to combine data across all of the sample_processing_ids in a repertoire (for a single data processing). That's one of the primary purposes of the repertoire (and why sample is an array versus being a single object), it allows multiple sample data to be combined together and considered as a single unit for analysis.

So yes, the syntax of them being "subsets" of a larger set is correct, but the semantics of those subsets are very different.

schristley commented 4 years ago

It is a subset, specifically you are asking for statistics for a single sequencing run.

The sample_processing_id isn't just to differentiate sequencing runs. Remember that if database stores a single repertoire that has been cell sorted so that the repertoire has two different types of T cells, you will have two different SampleProcessing objects and therefore two different sample_processing_ids. In this case, I think it would be common to want to be able to perform stats and compare the V Gene usage between these two different types of T Cells, which are once again "sample processing subsets of a repertoire". So you need your 2) above.

Sure, fine, as I say I'm okay with supporting 2). I would just like to elaborate on your example... Why did the researcher put those two samples together into a single repertoire, unless they wanted them to be combined and analyzed together? That is the purpose of the repertoire object, it's to combine samples to be treated as a single unit for analysis, ala the good ole abstraction layer concept from computer science.

So you talk about not encouraging researchers to do the wrong thing. I would encourage researchers not to do what you just describe unless they really intend those samples, and thus different T cell types, to be mixed together. Now there is good scientific reasons to do that. At the same time, I would expect there to be two additional repertoires in the study, one which holds each single T cell subset individually. So possibly 3 repertoires, one with mixed T cell subsets, and two with the individual T cell subsets. This allows repertoire comparison to be performed, exactly like what you say "compare the V Gene usage between these two different types of T Cells". We should not encourage researchers to design their repertoires in the wrong way, thinking that this is the "proper" way to do comparison, because it isn't. The repertoire is the unit of analysis and when we say "compare repertoires", we want researchers to think repertoires and not "samples within a repertoire". Go down that path too far and you start breaking the whole repertoire concept in the first place.

schristley commented 4 years ago

@bcorrie It might be worthwhile to review some discussion when we originally added sample_processing_id. One intent of having an id for the sample processing records in Repertoire is useful and had support (jason, chaim).

However, the other intent of having the mapping between a rearrangement and a sample processing record isn't guaranteed. You seemed to indicated that this wasn't actually your intent, though maybe your statement is more proper.

First, I will say it is trivially possible if the researcher does it... They don't have to, but this at least makes it possible. In out current spec, this is not possible.

So I can understand sample_processing_id in the API, as you say with this statement:

the repertoire has two different types of T cells, you will have two different SampleProcessing objects and therefore two different sample_processing_ids. In this case, I think it would be common to want to be able to perform stats and compare the V Gene usage between these two different types of T Cells, which are once again "sample processing subsets of a repertoire". So you need your 2) above.

But you won't be able to guarantee this for all studies because this requires that the rearrangements have a sample_processing_id attached (as you know, "they don't have to"). Are you okay with that? Do you intend to write code in the gateway UI to handle the different possibilities of when you can or cannot do this? If you are, that's great, I was just trying to make sure that you understood this so there wasn't a "gotcha" moment later, when other repositories return different results than what you might want or expect.

bcorrie commented 4 years ago

@schristley yes, I understand that not all rearrangements will have a sample_processing_id. Indeed all of repertoire_id, sample_processing_id, and data_processing_id are nullable and none are required at the rearrangement level. So it is possible to have rearrangements without any of these _id fields.

But the flip side is also true. If, in my data, if I do want to be able to associate a given rearrangement to a specific repertoire, and within that repertoire either a specific sample_processing_id or data_processing_id, then I need these fields at the rearrangement level, and indeed, if you have Repertoires that have different SampleProcessing and you want to differentiate these at the Rearrangement level you need to have a sample_processing_id in your Rearrangement data. If you have multiple SampleProcessing and you don't want to differentiate, then that is fine too. sample_processing_id may or may not be there for a given Rearrangement as you say. The curator of the data decides how they create the Repertoire and their internal SampleProcessing and DataProcessing structures. The Gateway/UI certainly does need to take into account these complexities and all of the possibilities.

Similarly for Stats, if you want stats on a Repertoire that has two SampleProcessings, one a B-cell subset and one a T-cell subset, you are definitely going to want to be able to say I want to do V gene usage on one of the cell subset (either B-cell or T-cell subset), as I don't think it really makes sense to do a V gene usage with IG and TR genes combined. If the data curator that stores the data chooses to make that impossible by using a single Repertoire and not providing a sample_processing_id to differentiate between the two then that is fine too (although I would challenge the value of their Rearrangement to Repertoire mapping if they did this when the two cell subsets were B-cell and T-cell data). If it is just T-cell data, and you as the study curator don't want to provide the ability to differentiate a rearrangement between the two types of T-cells that occur from the two different SampleProcessing, that is valid also.

I think what we want to ensure is that if data in a repository is curated with a Repertoire with multiple SampleProcessings AND the curator provides sample_processing_id in their Repertoire and Rearrangement data to differentiate the two (lets say both B-cell and T-cell SampleProcessing) we need the Stats API to be able to say the equivalent of I want V gene usage of the B-cell data only - so we need to be able to support queries on sample_processing_id.

bcorrie commented 4 years ago

I am pretty sure this is an example of why we need https://github.com/airr-community/airr-standards/issues/441 but diving in anyway... 8-)

Sure, fine, as I say I'm okay with supporting 2). I would just like to elaborate on your example... Why did the researcher put those two samples together into a single repertoire, unless they wanted them to be combined and analyzed together? That is the purpose of the repertoire object, it's to combine samples to be treated as a single unit for analysis, ala the good ole abstraction layer concept from computer science.

Correct me if I am wrong, but the biology concept of a Repertoire, and the 2s elevator pitch description from an AIRR perspective, is that a Repertoire is the set of T-cells and B-cells in a single individual at a specific time point. It is designed to be flexible for use, but that is my understanding of the "broadest" level of abstraction that a Repertoire can contain. That is, it is incorrect to have data from multiple subjects in a single Repertoire and it is incorrect to have data from multiple time points in a single Repertoire (in reading our documentation, I am less sure about the time point dimension - I thought this was the case, but it doesn't seem to be reflected anywhere).

One perfectly valid use for a Repertoire that might be used to describe a Study would be to group all B-cell and T-cell subsets for each individual at a specific time point into a single Repertoire. In this case, the data curator for the study is grouping them in the way that is the most logical for describing the Study. Certainly this is study focused and not analysis focused, but is actually a common way as to how I would expect the AIRR Standards to be used to describe a study. Certainly this is the orientation we take when we curate data from studies and load it into iReceptor repositories. We don't actually store hierarchical Repertoires (we flatten them, each Repertoire has one SampleProcessing and one DataProcessing), but that is logically how we think of them from understanding the Study and its structure and intent.

So you talk about not encouraging researchers to do the wrong thing. I would encourage researchers not to do what you just describe unless they really intend those samples, and thus different T cell types, to be mixed together.

As I say above, the Repertoires above are describing the logical entities in the Study, not the logical entities in a downstream analysis. From a curator perspective trying to get someone to understand the structure of my study I really do intend for these to be grouped together. So this is the Repertoire structure that I might store in a repository to capture my study structure... You might create a totally different Repertoire structure for your analysis, absolutely, but that doesn't necessarily mean that the study focused repertoire structure is incorrect does it?

schristley commented 4 years ago

So it is possible to have rearrangements without any of these _id fields.

Hopefully not in a ADC API repository!

schristley commented 4 years ago

Correct me if I am wrong, but the biology concept of a Repertoire, and the 2s elevator pitch description from an AIRR perspective, is that a Repertoire is the set of T-cells and B-cells in a single individual at a specific time point.

That's one definition for a biological repertoire but not the only definition. Biologists tend to have their own definitions of a repertoire. This was the original problem we side-stepped with Repertoire, instead of trying to define a single universal biological repertoire definition that everyone would agree upon, we put in flexibility. We essentially state that in the 3rd paragraph when introducing Repertoire.

As I say above, the Repertoires above are describing the logical entities in the Study, not the logical entities in a downstream analysis. From a curator perspective trying to get someone to understand the structure of my study I really do intend for these to be grouped together. So this is the Repertoire structure that I might store in a repository to capture my study structure... You might create a totally different Repertoire structure for your analysis, absolutely, but that doesn't necessarily mean that the study focused repertoire structure is incorrect does it?

This is certainly where some of the confusion lies, and why https://github.com/airr-community/airr-standards/issues/441 is important. Repertoire, as in the schema object, not the biological concept, was designed for analysis, not curation. We say that in the first paragraph, unit of analysis, not curation record.

Now I can totally understand if you are coming from a curation standpoint, and don't do analysis, that you may interpret the Repertoire object to be a curated description of the Study. That's a flaw in the concepts, or a flaw in the documentation, or all of the above. Curation of the study is actually the Study, Subject, ..., DataProcessing objects. Those are the core MiAIRR categories. The Repertoire is a convenience object, which is not part of the MiAIRR standard, that "gathers all of this information together into a composite object, which can be easily accessed by computer programs".

Each Repertoire is meant to be an (essentially) indivisible set of rearrangements (with the exception of different data processing) which an analysis tool operates upon, and the Repertoire data structure is a simple way (which is why it's denormalized) for the analysis tool to access the study metadata while it's doing some analysis on those rearrangements.

So this drives home another point in https://github.com/airr-community/airr-standards/issues/441 in that Repertoire is trying to do too much. It cannot be both a curation record and a convenience object for analysis tools to access study metadata. It's poorly structured and has too much duplicated data for a curation record.

So if you curate a study, and you put all of the samples for a subject into a single Repertoire, which is totally understandable if you are thinking of Repertoire as a curation record, then you are in essence saying "all these samples are a single unit to be analyzed together", i.e. the indivisible set of rearrangements.

Maybe In an optimal world, when you curate a study, you wouldn't create any Repertoire objects at all. You would just create the normal MiAIRR objects. However, you (as in the IPA data repository) are doing more than just curation. You are also data processing the raw data into Rearrangements, so that requires you to define Repertoire objects which describes that analysis and the set of Rearrangements that are part of that repertoire. In this, you become the researcher defining what the biological repertoire of interest is. The fact that you are flattening the repertoires, means you might be defining repertoires differently from the study authors.

bcorrie commented 4 years ago

This is certainly where some of the confusion lies, and why airr-community/airr-standards#441 is important. Repertoire, as in the schema object, not the biological concept, was designed for analysis, not curation. We say that in the first paragraph, unit of analysis, not curation record.

I am not sure I agree with that... Repertoire was designed to do both curation (it is the basis for /repertoire in the query API after all) and analysis. In fact, I would argue that our focus has been primarily on curating (describing studies AKA MiAIRR), storing (file formats and repositories), and querying (the ADC API) and less on analysis to date. Repertoire is currently fundamental to the ADC API which is curation and retrieval focused.

I suspect this discussion should be moved to the AIRR issue above and we should try and wrap up this discussion here.

Based on your thumbs up above in https://github.com/ireceptor-plus/specifications/issues/1#issuecomment-670705407, can we close off this issue on whether we need the three _id fields for the Stats API and move the AIRR Repertoire discussion to the AIRR issue 8-)

schristley commented 4 years ago

I am not sure I agree with that...

LOL, I designed the Repertoire object! I wrote the description of it. I certainly know why I designed it in a specific way! The fact that others re-interpret my design for their own purposes is something I cannot prevent... ;-D

Based on your thumbs up above in #1 (comment), can we close off this issue on whether we need the three _id fields for the Stats API and move the AIRR Repertoire discussion to the AIRR issue 8-)

Yeah, the only thing I think is missing are the appropriate flags that I detailed above.

repertoire_id in the request should be required and nullable:false

data_processing_id in the request should be required and nullable:false

sample_processing_id is not required and nullable:false

Though maybe these are different if we intend to interpret these differently:

identifier is not sent in the request.
identifier is sent in the request with a null value.

bcorrie commented 4 years ago

So it is possible to have rearrangements without any of these _id fields.

Hopefully not in a ADC API repository!

Yeah, it would be a repository of basic sequence annotations with no Repertoire information, but it would still be AIRR compliant according to the spec I believe.

bcorrie commented 4 years ago

Yeah, the only thing I think is missing are the appropriate flags that I detailed above.

repertoire_id in the request should be required and nullable:false

data_processing_id in the request should be required and nullable:false

sample_processing_id is not required and nullable:false

It seems like it might be useful to be able to do the following:

{
  "token": "string",
  "repertoires": [ { "repertoire_id": "REPID_4598" }, { "repertoire_id": "REPID_4598" } ],
  "fields": [ "v_subgroup", "v_gene", "d_call", "c_call" ]
}

We can't do this if data_processing_id is required. I would suggest having the same for data_processing_id and sample_processing_id, that is required = false and nullable = false. That is if you provide the field, it needs a value, and if you don't provide the field that is fine, you get everything from the Repertoire.

schristley commented 4 years ago

Yeah, the only thing I think is missing are the appropriate flags that I detailed above. repertoire_id in the request should be required and nullable:false data_processing_id in the request should be required and nullable:false sample_processing_id is not required and nullable:false

It seems like it might be useful to be able to do the following:
{
  "token": "string",
  "repertoires": [ { "repertoire_id": "REPID_4598" }, { "repertoire_id": "REPID_4598" } ],
  "fields": [ "v_subgroup", "v_gene", "d_call", "c_call" ]
}
We can't do this if data_processing_id is required. I would suggest having the same for data_processing_id and sample_processing_id, that is required = false and nullable = false. That is if you provide the field, it needs a value, and if you don't provide the field that is fine, you get everything from the Repertoire.

How is it useful? What is the scientific reason/use case to combine gene calls across different data_processing?

bcorrie commented 4 years ago

How is it useful? What is the scientific reason/use case to combine gene calls across different data_processing?

It is an ease of use "usefulness". If it is required, before one can do a stats API query on a repertoire, you have to query the repertoire, get all of its data_processing_id's (if it has any) and then choose one, and ask for the Stats. This is a bit of a pain - in particular because 99% of our current data has only one DataProcessing (admittedly that will likely change).

I understand where you are coming from about why it would probably be bad rather than just confusing to combine the data from two different data_processing_ids in a single stats request, but this confusion/ambiguity doesn't really exist if we use primary_annotation. If something is specified then you return the primary_annotation. If there isn't one and you had more than one annotation, then you would generate an error. These were the semantics I had in the early version of the Stats API here: https://github.com/ireceptor-plus/WP2/blob/437e0a7111ce7f6179d09af89c8188cec5e4718d/specs/stats-api.yaml#L85 In this case, you could just send a repertoire_id.

The main reason there is ambiguity around this as you describe above is because we are talking about taking away primary_annotation in the AIRR Standard: https://github.com/airr-community/airr-standards/issues/434 If we have a primary_annotation this isn't a problem, and it is queries like this one that justify having a primary_annotation in the first place. 8-)

schristley commented 4 years ago

How is it useful? What is the scientific reason/use case to combine gene calls across different data_processing?

It is an ease of use "usefulness". If it is required, before one can do a stats API query on a repertoire, you have to query the repertoire, get all of its data_processing_id's (if it has any) and then choose one, and ask for the Stats. This is a bit of a pain - in particular because 99% of our current data has only one DataProcessing (admittedly that will likely change).

I understand where you are coming from about why it would probably be bad rather than just confusing to combine the data from two different data_processing_ids in a single stats request, but this confusion/ambiguity doesn't really exist if we use primary_annotation. If something is specified then you return the primary_annotation. If there isn't one and you had more than one annotation, then you would generate an error. These were the semantics I had in the early version of the Stats API here: https://github.com/ireceptor-plus/WP2/blob/437e0a7111ce7f6179d09af89c8188cec5e4718d/specs/stats-api.yaml#L85 In this case, you could just send a repertoire_id.

The main reason there is ambiguity around this as you describe above is because we are talking about taking away primary_annotation in the AIRR Standard: airr-community/airr-standards#434 If we have a primary_annotation this isn't a problem, and it is queries like this one that justify having a primary_annotation in the first place. 8-)

Hmm, I see, maybe ;-D I was actually hoping to not propagate this special handling of data_processing_id. Though I'm confused on what you are suggesting, do you want to keep this special handling? That is, no data_processing_id means the API service has to find and pick the primary one and return stats for it, or are you suggesting that stats combines all data processing? I guess I was assuming the latter but maybe you mean the former ... which at least is ok scientifically.

I'm not suggesting to get rid of the primary_annotation field, that still needs to be there (or something like it). I'm suggesting getting rid of special handling where if data_processing_id is not specified, that it assumes the primary one. I'd prefer to always pass the data_processing_id. I figure if you got the repertoire_id through a query, you should easily have data_processing_id along with it, so just pass both.

bcorrie commented 4 years ago

Hmm, I see, maybe ;-D I was actually hoping to not propagate this special handling of data_processing_id. Though I'm confused on what you are suggesting, do you want to keep this special handling? That is, no data_processing_id means the API service has to find and pick the primary one and return stats for it, or are you suggesting that stats combines all data processing? I guess I was assuming the latter but maybe you mean the former ... which at least is ok scientifically.

Yes, this is what I was suggesting. If we have primary_annotation then you can just ask for a Stat and send repertoire_id and get a meaningful result. The API can take care of the checking for more than one DataProcessing and if so get the primary annotation. That seems better to me than having the user to have to do the work.

I'm not suggesting to get rid of the primary_annotation field, that still needs to be there (or something like it). I'm suggesting getting rid of special handling where if data_processing_id is not specified, that it assumes the primary one. I'd prefer to always pass the data_processing_id. I figure if you got the repertoire_id through a query, you should easily have data_processing_id along with it, so just pass both.

It seems to me the workflow of the user of the API would at a minimum be:

Find a repertoire of interest and get its repertoire_id, and possibly all of its data_processing_ids
Check to see if the Repertoire has more than one data_processing_id
If so, decide on which one to use, either the primary_annotation or some other one based on the user choice

Note that at least 1 and 2 above have to be performed every time, and if there is more than one DataProcessing then you have to do step 3. So its true that you might always have one, but when there is more than one you have to choose with a couple of steps at least.

If we take advantage of having a primary_annotation and use it by default, then the default, simple workflow on getting Stats for a Repertoire is much simpler. You do 1 and ask for Stats. You only have to take data_processing_id into account (and do 2 or 3) if you know about and care about a secondary DataProcessing.

I suppose my question would be, why have a primary_annotation if you can't take advantage of it?

schristley commented 4 years ago

I suppose my question would be, why have a primary_annotation if you can't take advantage of it?

Sure, the reason for that issue is that the ADC API does not use it like it should. While it sounds good, pragmatically it's not feasible to implement. For example, if you perform any generic query on /rearrangement where you don't specify a data_processing_id, the doc essentially implies that you should find the primary annotation and only return rearrangements for that data processing. The reference library does not implement that. That's pretty hard for a generic query where you don't even know what repertoires are being requested...

Anyways, after the recent DataRep discussions, I have a decent sense that the multiple data processing per Repertoire is going to be backtracked, and there will only be one DataProcessing per Repertoire. Thus it's implicitly the primary annotation for that Repertoire, furthermore it would mean data_processing_id isn't needed in rearrangements as repertoire_id would be sufficient. Thus I'm fine with a blank data_processing_id meaning the primary annotation, as that will be the easiest to convert in the future, and change my position to:

data_processing_id in the request should be not required and nullable:false

bcorrie commented 4 years ago

@schristley do you want to have a look at the current spec and see if we can close this issue off. Currently repertoire_id is required, the others are not. Both sample_processing_id and data_processing_id are currently nullable. Should they be non-nullable?

Clearly we want to allow a stats repertoire query like this:

[{"repertoire_id":"REP1"}, {"repertoire_id":"REP2"}]

Is there any reason we want to allow sample_processing_id and data_processing_id to be nullable? I would think if the user specifies the field, they should specify a value as there is no purpose to specifying null as far as I can tell.

So if we make both of these as not required and nullable:false, are we OK on this?

bcorrie commented 4 years ago

I suppose the one caveat would be that supplying a null field would be one mechanism to deal with #5

That is a query of:

[{"repertoire_id":"4"}]

Would give:

[{repertoire_id:'4', count: 294}]

where a query of

[{"repertoire_id":"4", sample_processing_id:null}]

Would give something like:

[{repertoire_id:'4', sample_processing_id: 'first', count :148}, {repertoire_id:'4', sample_processing_id: 'second', count: 146}]

Note I don't think these responses are quite correct for the stats API in its current form...

I am not sure I like this as it is overloading behaviour in an odd way...

schristley commented 4 years ago

So if we make both of these as not required and nullable:false, are we OK on this?

Sure, that's OK to do now so we can move forward with a beta implementation.

bcorrie commented 4 years ago

Done

bcorrie commented 4 years ago

OK, closing this issue...