"All" keyword in repertoires parameter for stats API

bzimonja commented 4 years ago

for all the entry points, repertoire_object parameter has optional "sample_processing_id" and "data_processing_id" fields, to limit which sample and data processing fields we want.

Would it be useful to have an 'all' (or 'any') value for these to indicate we would like to break down the response by, for example, sample_processing_id, without forcing the user to specify each repertiore_id+sample_processing_id combination?

schristley commented 4 years ago

One way is to just not send sample_processing_id in the request. If the parameter isn't there then that implicitly means "all". That would naturally flow into building the query, if the field is missing from the query, then it wouldn't filter the data on that field.

bzimonja commented 4 years ago

I meant in the response (in request, not sending it would imply 'all' indeed). repertoire object we use in it has sample_processing_id and data_processing_id as nullable, so I thought the return wouldn't include them. E.g. if you supply repertoire_id 4, sample_processing_id 'first', you'd get counts as

[{repertoire_id: '4', sample_processing_id: 'first', count: 148}]

and if you didn't

[{repertoire_id:'4', count: 294}]

With sample_processing_id: 'all', response would be

[{repertoire_id:'4', sample_processing_id: 'first', count :148}, {repertoire_id:'4', sample_processing_id: 'second', count: 146}]

I.e. you indicate you don't filter the sample processing ids, but would like the breakdown by them.

schristley commented 4 years ago

Ah I see, yes somewhat like how facets works. I agree that would be useful. I would suggest that we don't use a special "all" id though, as we can easily add a boolean parameter. Or we can skip the parameter completely and make it the default behavior, so an empty sample_processing_id gave this:

[{repertoire_id:'4', count: 294}, {repertoire_id:'4', sample_processing_id: 'first', count :148}, {repertoire_id:'4', sample_processing_id: 'second', count: 146}]

yyweiss commented 4 years ago

Following what I mentioned in #3 , it could be possible to have a query which returns all the repertoires, and a key function (which could itself be a query) which says how to group the response.

Here too you can set for which queries support is required (e.g., group by sample_processing_id,) and which are not.

bcorrie commented 4 years ago

@yyweiss could you give an example of the use case you are thinking of.

One of our design goals was to keep the Stats API as simple as possible. Essentially provide simple counts for the Repertoires requested, and let the client perform any grouping and analysis that is required. The main reason is that the ways that you could combine the data are extensive, and having an API that does all of that is challenging. On the flip side, the data that is returned is not particularly large, so it is relatively easy for the client to process the data and use one of the many language specific libraries/tools to process the data and do the grouping that you are talking about.

yyweiss commented 4 years ago

Sorry for the delays. It's the holidays here, so response time is generally longer.

I'm thinking about queries such as "Return counts for all samples/repertoires involving illness x." If I understand correctly, that is what bzimonja requested in the opening post of this issue.

The problem is not the combination of data, but the splitting of data. You can always make one query to retrieve a list of repertoires, and then create the query you want. However, (1) that is an extra step which has to be taken. (2) The second query doesn't indicate what it means. (3) You can't rerun that query in the future to get updated stats on new repertoires.

bcorrie commented 4 years ago

To get such a response is a two step process - by design. The ADC API is the query API that you would use to gather the set of Repertoires involving illness X (Query 1) and then you would use the Stats API (Query 2) to get the Stats for all of the Repertoires you found in Query 1.

Query 1 to a repository such as http://covid19-1.ireceptor.org/airr/v1/repertoire would look like this:

{
"filters":
    {
        "op":"=",
        "content": {
          "field":"subject.diagnosis.disease_diagnosis",
          "value":"DOID:0080600"
    },
"fields":["reperotire_id","data_processing.data_processing_id", "sample.sample_processing_id"]
}

and then for each repertoire_id, data_processing_id, and sample_processing_id (e.g. RP1, DP1, SP1) you would ask for a specific stat (or set of stats):

Query 2 http://covid19-1.ireceptor.org/irplus/v1/stats/gene_usage would look like this :

{
  "repertoires": [
    {
      "repertoire_id": "RP1",
      "sample_processing_id": "SP1",
      "data_processing_id": "DP1"
    }
  ],
  "statistics": [
    "v_subgroup",
    "d_subgroup",
    "j_subgroup",
  ]
}

Given that the ADC API already exists we don't want to reimplement its search capability in the Stats API. We want to keep the Stats API as simple as possible but be able use the ADC API and the Stats API together to accomplish the task you outline.

@bzimonja was asking a simpler question in that in the above Stats API call, when you ask for just RP1, is there an easy way to get all of the counts for all of the DPs and SPs without having to list them all explicitly. At least I think that is correct...

yyweiss commented 4 years ago

What is bothering me with this design is that it is locking the stats API to be able to only provide stats for very specific rearrangement sets. You can always query the rearrangements directly and do the counting yourself, but that defeats the purpose of having the Stats API in the first place.

As it is, you still need to be able to specify if you want the results grouped by sample_processing_id, data_processing_id, both, or neither, so you need more than a binary flag.

bcorrie commented 4 years ago

@bzimonja @schristley given that the count response is now the same structure as the other statistics, a rearrangement_count statistic response for just repertoire_id == 4 would look like this:

  "Result": [
    { "repertoire": { "repertoire_id": "4"}, "statistics": [ {"statistic_name": "rearrangement_count", "total": 294, "data": [ {"key": "rearrangement_count", "count": 294} ]  } ] }
  ]

As @yyweiss says, there are 4 possible combinations of different way to present the results of this if you want some sort of "all" capability.

Just count for repertoire_id
repertoire_id and all data_processing_id
repertoire_id and all sample_processing_id
repertoire_id and all sample_processing_id and all data_processing_id

I would suggest that maybe we should keep it simple, and just give the user what they ask for. If the user wants any of the all situations, they can get the list of all _ids from a query. Not the most convenient, but...

If not, what is the mechanism we use to allow the user to say they want counts for all of the above...

bcorrie commented 4 years ago

What is bothering me with this design is that it is locking the stats API to be able to only provide stats for very specific rearrangement sets. You can always query the rearrangements directly and do the counting yourself, but that defeats the purpose of having the Stats API in the first place.

@yyweiss the Stats API can by design only do Stats for very specific rearrangement sets. The rearrangements in the AIRR Spec have three IDs by which they are linked back to a Repertoire and its components (SampleProcessing and DataProcessing). It is exactly those rearrangement sets that the Stats API is designed to get Stats for. This limitation allows us to do these Stats very quickly.

If you want to do more Stats with more general queries then they are likely not quick and therefore the Analysis API should be used.

schristley commented 4 years ago

@bzimonja @schristley given that the count response is now the same structure as the other statistics, a rearrangement_count statistic response for just repertoire_id == 4 would look like this:

Right, there will only ever be one key value for each count statistic, but that's fine.

I would suggest that maybe we should keep it simple, and just give the user what they ask for. If the user wants any of the all situations, they can get the list of all _ids from a query. Not the most convenient, but...

I'm fine with keeping it simple.

bcorrie commented 4 years ago

@bzimonja and @schristley I think we can close this issue... If you give me the thumbs up I will close it.

bcorrie commented 4 years ago

Closing...

ireceptor-plus / specifications

"All" keyword in repertoires parameter for stats API #5