ludvb / xfuse

Super-resolved spatial transcriptomics by deep data fusion
MIT License
67 stars 13 forks source link

Is it possible for Gene_maps values to use the same scale instead of min-max per gene? #72

Open NicolaasVanRenne opened 1 year ago

NicolaasVanRenne commented 1 year ago

If I understand correctly; the gene maps are minimum and maximum expression for each gene.

However, genes that are hardly expressed in your biological sample will look like they have a strong expression pattern which is overblown (imho).

Would it be possible to have an output where the gene_maps of all genes have the same scale? In this way, genes that are hardly expressed would also show dramatic expression. For example in my tissue, CD1E expression is really low, but it appears enormous in XFuse (with a distribution that doesnt make much sense, its more noise that gets amplified...)

image

Maybe I did sth wrong and this should not be? I expect in this case that the entire tissue would be blackish colour...

ludvb commented 1 year ago

Yes, that's correct. There is currently no option to give the gene maps a common scale. The reason for this is that they are produced one gene at a time, so we don't know the max until all genes have been processed. What you could do possibly is to save the underlying data that the maps are based on by setting writer = "tensor" in the gene_maps section of the config toml file. Then you could create the figures manually. Keep in mind, however, that the tensor file gets very large, so you probably don't want to do this for all genes in the dataset and instead limit the gene set using the gene_regex option.

As an aside, I agree that this gene map looks quite weird. Usually, microstructure (e.g. nuclei) are typically visually distinct. What is the resolution of the H&E after scaling? Usually, I aim for something like 1500 - 2000 px in height/width.

NicolaasVanRenne commented 1 year ago

I tried what you said, and the genes I selected are now in the output folder as a .pt format.

1) How do I turn these .pt files into tiff gene expression maps?

2) Are these gene maps then trained on the entire gene expression data, or only on the genes that I put in the gene_regex?

3) Can you just elaborate a bit on what writer ="tensor" does (and maybe where in the code it is located - would like to understand)

my toml file was something like this;

[xfuse]
network_depth = 6
network_width = 16
min_counts = 50

[expansion_strategy]
type = "DropAndSplit"
[expansion_strategy.DropAndSplit]
max_metagenes = 50

[optimization]
batch_size = 3
epochs = 100000
learning_rate = 0.0003
patch_size = 768

[analyses]
[analyses.metagenes]
type = "metagenes"
[analyses.metagenes.options]
method = "pca"

[analyses.gene_maps]
type = "gene_maps"
[analyses.gene_maps.options]
gene_regex = "^(GENE1|GENE2|GENE3| (and so on until GENE 34))$"
writer = "tensor"

[slides]
[slides.section1]
data = "sections/sample1/data.h5"
[slides.section1.covariates]
section = 1

[slides.section2]
data = "sections/sample2/data.h5"
[slides.section2.covariates]
section = 2

[slides.section3]
data = "sections/sample3/data.h5"
[slides.section3.covariates]
section = 3

[slides.section4]
data = "sections/sample4/data.h5"
[slides.section4.covariates]
section = 4
ludvb commented 1 year ago

Awesome!

How do I turn these .pt files into tiff gene expression maps?

You will have to do this manually. To give you an idea, something like the below script should work (it's untested so it will likely need some slight modifications to work). You need to run it in the same directory as the *.pt files. It first computes the max expression over all maps and then normalizes the gene maps based on that. Note that it could be that the max will be too high and all maps will then be quite dark. In that case, it could make more sense to clip the maps on the 99th percentile or similar instead of the max.

import os
from glob import glob

import matplotlib.cm as cm
import tifffile

def load_gene_map(path):
    gene_map = torch.load(path, map_location="cpu")
    gene_map = gene_map.mean(0)
    return gene_map

max_expression_all = 0
gene_map_files = glob("*.pt")
for path in gene_map_files:
    gene_map = load_gene_map(path)
    max_expression = gene_map.max()
    if max_expression > max_expression_all:
        max_expression_all = max_expression

for path in gene_map_files:
    gene_name,_ = os.path.splitext(os.path.basename(path))
    gene_map = load_gene_map(path)
    gene_map = gene_map / max_expression_all
    gene_map = gene_map.numpy()
    gene_map = cm.inferno(gene_map)
    tifffile.imsave(f"{gene_name}.tif", gene_map)

Are these gene maps then trained on the entire gene expression data, or only on the genes that I put in the gene_regex?

Good question! As long as the model is trained on the entire gene expression data, the latent state and therefore maps will also be informed by other genes.

Can you just elaborate a bit on what writer ="tensor" does (and maybe where in the code it is located - would like to understand)

It sets the gene map writer to the "tensor" writer, which saves the sampled gene maps as pytorch tensors (as opposed to image files saved by the "image" writer). The tensor writer is defined here: https://github.com/ludvb/xfuse/blob/c420abb013c02f44120205ac184c393c14dcd14d/xfuse/analyze/gene_maps.py#L157-L160