List of plots [REPLACEMENT ISSUE]

grst commented 4 years ago

The original issue

Id: 9
Title: List of plots

could not be created. This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:

TODO

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 22, 2020, 16:29

changed the description

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 22, 2020, 16:30

changed the description

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 22, 2020, 16:33

changed the description

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 22, 2020, 16:40

changed the description

grst commented 4 years ago

In GitLab by @grst on Jan 23, 2020, 12:02

Clonal Expansion visualized in the Zheng et al (2017) paper.

grst commented 4 years ago

In GitLab by @grst on Jan 24, 2020, 11:30

@szabogtamas

number of inserted nucleotides -> box or bar plots by cell group; or maybe color-based on the umap

number of nucleotides inserted in the vj junction -> box or bar plots by cell group; or maybe color-based on the umap

Where do we find that data in 10x and Tracer files?

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 24, 2020, 12:03

This is an information that has to be calculated by the preprocessing script. In the contigs.json file we have the start and and positions for V, D and J blocks and if the start of D is not the one after the end of V, then there are inserted nucleotides

grst commented 4 years ago

In GitLab by @grst on Jan 24, 2020, 12:15

ok, so no way getting this from the .csv file. Could you check where we can find this information in the Tracer data?

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 24, 2020, 12:49

With 10x it is not a problem. This is included in the current preprocessing script and we can make it a separate function easily.

With Tracer it is a good point. I cannot find a hint in the summary files. What we can do is to parse the filtered_TCR_seqs/filtered_TCRs.txt files for the trinity id of the chosen sequences (the ones that will be considered as TCR of the cell) and then extract the IgBlast result for that trinity id from that file we were looking at the last time. This would also solve the CDR1&2 issue. If we are lucky, we can convert the IgBlast output to json first and then it is doable.

grst commented 4 years ago

In GitLab by @grst on Jan 24, 2020, 14:11

I agree that parsing the summary data from tracer is not enough (see also #10).

Some more points regarding the plots:

length of CDR3 regions -> box or bar plots by cell group; or maybe color-based on the umap

number of inserted nucleotides -> box or bar plots by cell group; or maybe color-based on the umap

number of nucleotides inserted in the vj junction -> box or bar plots by cell group; or maybe color-based on the umap

Here, again, we will need to define for which chains we want to plot that. Aggregate it by cell? Make (up to) four different plots?

number (or ratio among total cell number) of cells in a clonotype -> Sankey-like plots or bar plots by cell group; or maybe a treemap by cell group or a heat plot for two different groupings (samples vs cell types)

I don't get this one, do you have an example figure?

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 27, 2020, 16:29

Wishlist restructured

I. Cell-based information:
Calculated once by the preprocessing script when importing 10x or Tracer results

[x] length of CDR3 regions (continous, four columns)
[x] number of inserted nucleotides (continous, four columns). VJ for alpha, VD + DJ for Beta.
[x] CDR3 sequence, AA and NT (string, four columns each)
[x] presence of secondary chain (categorical, single column)
[x] clonotype
[x] VDJ genes (categorical, 12 columns)

I.2 Cell-based information, generated

[x] clonotypes (categorical, single column)
[ ] convergence of chains [nucleotide versions of a single CDR3 aa sequence] (continous, four columns)

II. Cell-cell relation:
These are basically cell-based features, but the table would just explode if we wanted to include it in the cell table. It might be better to keep it separately in sparse matrix or an upper triangle. Or create umaps and store the x and y of umaps in the cell table?

[ ] Shared clonotypes
[ ] Shared chains (four columns)
[ ] Shared CDR3aa but different CDR3nuc
[ ] Similar CDR3aa sequences (tcrdist)
[ ] Similar physicochemical features (Kidera factors)
[ ] ?Shared kmers?
[ ] ?GLIPH networks?
[ ] ?Chains recognizing the same eiptopes based on McPAS-TCR?
[ ] ?epitope reactivity? (list, single column) -> query external database
number of inserted nucleotides (continous, four columns)

III. Group-based features
Has to be calculated on the fly or when creating groups <- from cell-based information

number of cells in the group (absolute number and ratios)
size of clonotypes in the group (absolute number and ratios) <- from clonotype membership and number of cells in the group
- clonotype multiplicity <- from clonotype membership and number of cells in the group
- size of singleton clonotypes in the group (absolute number and ratios) <- from clonotype multiplicity and number of cells in the group
- size of doublet clonotypes in the group (absolute number and ratios) <- from clonotype multiplicity and number of cells in the group
- size of triplet clonotypes in the group (absolute number and ratios) <- from clonotype multiplicity and number of cells in the group
- size of quadriple clonotypes in the group (absolute number and ratios) <- from clonotype multiplicity and number of cells in the group
- quartile distribution of clonotypes in the group <- from size of clonotypes
diversity of the group <- from clonotype membership and number of cells in the group
spectratype (cell number by CDR3 length) in the group <- from length of CDR3 regions
VDJ (or VJ) usage in the group <- from VDJ genes
sequence logo of group <- from chain identities

IV. Group-based features
Probably this one is the most problematic, but the question will often arise in this context: which samples, patients or cell types share a feature...

repertoire overlap among groups
?similar spectratypes?
?similar usage of VDJ genes?
+all the features calculated as cell-cell relations

grst commented 4 years ago

In GitLab by @grst on Jan 30, 2020, 11:57

List of plotting functions to implement moved to the very top

grst commented 4 years ago

In GitLab by @grst on Jan 30, 2020, 12:59

changed title from {-Write a list of plots we want to have-} to {+List of plots+}

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 30, 2020, 15:40

We have sequence_logo both as a tool and as a plotting function. I would only go for the plotting function. Maybe I would even put the alpha_diversity into the plotting part only. It will always have to be recalculated by groups and we might not even want to store it. Furthermore, the best place to store them would be the uns that I would like to keep as empty as possible.

If so, convergence calculation might also be a plotting function only.

grst commented 4 years ago

In GitLab by @grst on Jan 30, 2020, 15:44

Hmm, I'll think about it.

In scanpy it is common practice to have everything as both a tool and a plotting function. In scanpy, a plotting function never computes anything, it just displays stuff that is already in anndata.

This makes especially sense, when plotting something that takes a long time to compute (e.g. UMAP, and, in our case, sequence logos).

I agree that it feels a bit cumbersome, though, to always call tl.alpha_diversity and pl.alpha_diversity for each group.

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 30, 2020, 16:15

Another, conceptual question is the naming of the functions. For most plots, the same dataset (a list or dictionary of values) would be passed on to draw a violin, box or barplot. Should we name the plotting functions according to what they draw (violin or bar) or based on the question they answer? In the latter case, the visualization type would only be an attribute.

For example:

st.pl.cdr3_length(adata, groupby=None, subgroupby=None, vistype='violin')
- the groupby argument can be the name of a grouping column or None to show this for the whole population
- subgroupby would give us the paired or stacked columns if specified
- vistype specifies the actual look of the plot; would be violin and umap for now, but later we could add box and bar, as well as histogram and that is actually equal to a spectratype
st.pl.group_abundance(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='bar')
st.pl.group_overlap(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='chord')
st.pl.diversity(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='bar')
st.pl.convergence(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='bar')
st.pl.vdj_usage(adata, groupby='sample', relative=None, vistype='chord')
st.pl.sequence_logos(adata, groupby='sample', vistype='logo')
st.pl.group_similarities(adata, groupby='sample', distancematrix='tcrdist', vistype='chord|umap|dendrogram')

Of course, we could have a separate set of plotting functions called chord and so on that would be called by the upper convenience functions.

grst commented 4 years ago

In GitLab by @grst on Jan 30, 2020, 16:20

I think we should differentiate between the basic and specific plotting functions.

The basic ones should be named by what they show (e.g. violin).

For the specific ones, I'm in favor of having a single function for which the vistype is an option. Like that we can also start with a single visualization, and add others later on.

grst commented 4 years ago

In GitLab by @grst on Jan 31, 2020, 08:44

In scanpy they just have a function for each vistype, e.g.

sc.pl.rank_genes_groups_dotplot
sc.pl.rank_genes_groups_matrixplot
sc.pl.rank_genes_groups_heatmap

Pro: it generates visibility for each visualization type Con: It's just soo many functions.

I think I'm still in favor of having the vistype argument.

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 31, 2020, 09:57

Or we can make a "mother function" that has the vistype option and thus it is easier to extend for now and create "fake" plotting functions that just call the "mother function" with one specific vistype argument, just to conform scanpy conventions better...

grst commented 4 years ago

In GitLab by @grst on Jan 31, 2020, 10:24

Regarding the duplication of functions in plotting and tools:

Pro:

makes sense for computationally expensive plots (e.g. sequence logos) to not re-compute everything, just because you want to change the axis label
It might be sometimes relevant to get the raw values (e.g. get the actual entropy values from alpha_diversity instead of just plotting them.
Seems to be a common pattern in scanpy

Con:

More complicated for the user to call two functions to get a plot.
Storing stuff in AnnData that might never be needed again

A compromise could be to offer both pl and tl functions, but automatically run the tool with default parameters from the plotting function, if it has not been precomputed.

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 31, 2020, 10:38

Yes, I think checking if the tool function was run and calling it from the plotting function if not is an excellent idea! We should do this!

Regarding the raw values: at some point there might be a need for creating tables. Especially, if it is just diversity scores for a couple of samples or the abundance of the top 10 clonotypes in the samples. In my mind this would also be a plotting function.

grst commented 4 years ago

In GitLab by @grst on Jan 31, 2020, 10:41

Regarding the raw values: at some point there might be a need for creating tables. Especially, if it is just diversity scores for a couple of samples or the abundance of the top 10 clonotypes in the samples. In my mind this would also be a plotting function.

Let's keep that in mind. Maybe vistype='table' could be an option.

grst commented 4 years ago

In GitLab by @grst on Jan 31, 2020, 10:46

changed the description

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 31, 2020, 10:48

Yes, exactly: vistype='table' and leave the implementation for a later time point.

grst commented 4 years ago

In GitLab by @grst on Jan 31, 2020, 10:53

Can you please integrate this list into the overview at the top? I think there are some duplicates...

grst commented 4 years ago

In GitLab by @grst on Jan 31, 2020, 10:57

changed the description

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 31, 2020, 11:38

changed the description

grst commented 4 years ago

In GitLab by @szabogtamas on Jan 31, 2020, 11:40

I edited the overview.

grst commented 4 years ago

In GitLab by @grst on Feb 2, 2020, 19:48

marked the task st.pl.clonal_expansion(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='bar' Show fraction of n=1, n=2 and n>=3 clonotypes for each group in the groupby (optionally combined with subgroupby) grouping in obs. If relative is not None, it should point to a grouping, ideally one already supplied as groupby or subgroupby. as completed

grst commented 4 years ago

In GitLab by @grst on Feb 2, 2020, 19:48

marked the task st.pl.alpha_diverities(adata, forgroup='clonotype', groupby='sample', subgroupby=None, vistype='bar' Simply plot the diversity scores calculated by the st.tl.alpha_diverities function. as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:36

marked the task st.pl.group_abundance(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='bar') The number of ratio of a group present in another group for example, the presence of the top10 clonotypes in each sample. as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:36

marked the task st.pl.cdr_convergence(adata, forchain='alpha', groupby='sample', subgroupby=None, vistype='bar' For each cell, we check, how many nucleotide versions of the CDR3 region of forchain exist in groupby as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:37

marked the task st.pl.chain_pairing(adata, forchain='alpha', groupby='sample', subgroupby=None, vistype='bar' Plots the ratio of single pair, double pair and orphan alpha or beta chain cells. We just call a basic plotting function, just include it as a separate function for the sake of completeness. as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:37

marked the task st.pl.spectratype(adata, groupby='sample', subgroupby='Vgene', relative=None, vistype='chord') The distribution (pdf) of CDR3 lengths in cell groups (cell types, samples, or cells with a specific V gene); stakced barplot or just histogram curves as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:38

marked the task st.pl.repertoire_overlap(adata, forgroup='clonotype', groupby='sample', subgroupby=None, relative=None, vistype='chord') The number or fraction of cells that belong to the same forgroup but different groupby. In principle, it has to be computed pairwise and results in a similarity matrix for the groups in groupby. In case of fractions, we have to know what is the base. I am not sure if subgroupby is an option here. as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:39

marked the task st.pl.sequence_logo(adata, group=celltypes['CD8'], letter=amino_acids, vistype='logo') as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:39

marked the task st.pl.sequence_logos(adata, groupby=celltypes, letter=amino_acids, vistype='logo') as completed

grst commented 4 years ago

In GitLab by @szabogtamas on Feb 12, 2020, 13:47

closed

scverse / scirpy

List of plots [REPLACEMENT ISSUE] #9

Wishlist restructured