What should be the main input (data_sample) of gliphR?

snaketron commented 1 year ago

The original gliph algorithm uses as input the following:

minimum: vector of CDR3b sequences
maximum: data.frame with CDR3b + V + J (+ 3 columns for alpha chain)

To use V+J information in such a way that processes of local/global clustering are affected, the user also has to use setup additional input parameters. It is my impression that very few users do this, i.e. most users will provide CDR3b sequences only as input.

Hence my suggestion for gliphR:

We use as main input (parameter data_sample) a data.frame with 1 or 2 columns:

if 1 column -> the column has to be named CDR3a or CDR3b
if 2 columns -> the columns will represent CDR3a and CDR3b (order not relevant)

What do you think?

kaozkai commented 1 year ago

Hey Simo,

as far as I understand by now, there are two options affected by the V-gene information, one only used by Gliph1/turboGliph and one used by both Gliph1 and Gliph2.

Gliph1 offers the "vgene_stratify" option, to sample from the example database stratified by V-gene frequency distribution of the input data.

Both Gliph1 and Gliph2 use the "global_vgene" option to restrict global relationships to TCRs of common V-genes.

Furthermore, Gliph1 and Gliph2 seem to use V-gene information for the scoring part as "v_usage_freq".

I don't know yet how much this "v_usage_freq" affects the scoring; maybe we should keep it as optional input until we are sure about it's importance?

snaketron commented 1 year ago

@kaozkai

Gliph1 offers the "vgene_stratify" option, to sample from the example database stratified by V-gene frequency distribution of the input data. Both Gliph1 and Gliph2 use the "global_vgene" option to restrict global relationships to TCRs of common V-genes.

Exactly! Both of these functionalities are disabled by default. Hence we can ignore them for the moment in terms of the main gliph algorithm (the clustering part).

Furthermore, Gliph1 and Gliph2 seem to use V-gene information for the scoring part as "v_usage_freq". I don't know yet how much this "v_usage_freq" affects the scoring; maybe we should keep it as optional input until we are sure about it's importance?

This is the key use of V or J genes - to find out if clusters are enriched with a specific type of V or J gene.

To avoid this problem I have split in two functions: 1) clustering and 2) scoring. For gliph (the clustering part) we only need CDR3s. This makes the gliph function simple and easy to use. The scoring function (incomplete) should be very versatile. It should be able to accept as input any vector of attributes (numeric or categorical) for each cell (TCR) contained as row in the original data_sample, and to compute enrichment scores based on the attribute.

snaketron / ClustIRR

What should be the main input (data_sample) of gliphR? #5