snaketron / ClustIRR

Clustering of immune receptor repertoires
GNU General Public License v3.0
2 stars 0 forks source link

What should be the main input (data_sample) of gliphR? #5

Closed snaketron closed 1 year ago

snaketron commented 1 year ago

The original gliph algorithm uses as input the following:

To use V+J information in such a way that processes of local/global clustering are affected, the user also has to use setup additional input parameters. It is my impression that very few users do this, i.e. most users will provide CDR3b sequences only as input.

Hence my suggestion for gliphR:

We use as main input (parameter data_sample) a data.frame with 1 or 2 columns:

What do you think?

kaozkai commented 1 year ago

Hey Simo,

as far as I understand by now, there are two options affected by the V-gene information, one only used by Gliph1/turboGliph and one used by both Gliph1 and Gliph2.

Gliph1 offers the "vgene_stratify" option, to sample from the example database stratified by V-gene frequency distribution of the input data.

Both Gliph1 and Gliph2 use the "global_vgene" option to restrict global relationships to TCRs of common V-genes.

Furthermore, Gliph1 and Gliph2 seem to use V-gene information for the scoring part as "v_usage_freq".

I don't know yet how much this "v_usage_freq" affects the scoring; maybe we should keep it as optional input until we are sure about it's importance?

snaketron commented 1 year ago

@kaozkai

Gliph1 offers the "vgene_stratify" option, to sample from the example database stratified by V-gene frequency distribution of the input data. Both Gliph1 and Gliph2 use the "global_vgene" option to restrict global relationships to TCRs of common V-genes.

Exactly! Both of these functionalities are disabled by default. Hence we can ignore them for the moment in terms of the main gliph algorithm (the clustering part).

Furthermore, Gliph1 and Gliph2 seem to use V-gene information for the scoring part as "v_usage_freq". I don't know yet how much this "v_usage_freq" affects the scoring; maybe we should keep it as optional input until we are sure about it's importance?

This is the key use of V or J genes - to find out if clusters are enriched with a specific type of V or J gene.

To avoid this problem I have split in two functions: 1) clustering and 2) scoring. For gliph (the clustering part) we only need CDR3s. This makes the gliph function simple and easy to use. The scoring function (incomplete) should be very versatile. It should be able to accept as input any vector of attributes (numeric or categorical) for each cell (TCR) contained as row in the original data_sample, and to compute enrichment scores based on the attribute.