mahmoodlab / HEST

HEST: Bringing Spatial Transcriptomics and Histopathology together - NeurIPS 2024
Other
164 stars 12 forks source link

Do we need to correct the batch effects of given datasets #43

Open HelloWorldLTY opened 2 months ago

HelloWorldLTY commented 2 months ago

Hi, thanks for your great work. I wonder if we need to correct the batch effects of these spatial transcriptomic data or not. Thanks a lot!

guillaumejaume commented 2 months ago

Hi, it depends on what you want to do with HEST data. What's your use case?

HelloWorldLTY commented 2 months ago

I am interested in the Visium data only. Thanks.

guillaumejaume commented 2 months ago

Visium data integrated into HEST-1k are very diverse: 2 species (mouse and human), multiple diseases, and organs. Batch effect correction should always be done if there are some guarantees that it won't significantly alter the biological signal.

To give a better answer, I need a better understanding of your problem statement, e.g., multimodal representation learning, ST prediction from H&E, characterization of morphological correlates of expression changes, etc.

If you want to explore batch effect, we implemented 2 core functions:

HelloWorldLTY commented 2 months ago

Thanks! I will take a look at it!

guillaumejaume commented 2 months ago

@HelloWorldLTY, feel free to document any findings on this GitHub issue.

skambha6 commented 1 month ago

Related to this, I am noticing fairly strong batch effects by sample-of-origin for the H&E patch embeddings from Visium data even from the same tissue and disease. Is this to be expected or am I missing a key pre-processing step? I am loading in the patches using a H5HESTDataset object and applying only the model-specific eval_transforms (which generally appear to be resizing and ImageNet Normalization).

guillaumejaume commented 1 month ago

Batch effects in the H&E images exist. Why patch encoder are you using?

skambha6 commented 1 month ago

I see this with both the Gigapath and UNI encoders.

guillaumejaume commented 1 month ago

In my experience CONCH is less sensitive to staining variations. Also, keep in mind that the image latent space can express staining variations, while also encoding all the relevant biological signal. Depending on the downstream task, it may not be critical.

skambha6 commented 1 month ago

I see. Are there any ways to correct for the staining variations with preprocessing/normalization? It seems that Harmony can remove some of the image batch effects from the embeddings, but not all.

guillaumejaume commented 1 month ago

Many approaches exist for stain normalization in computational pathology, e.g., Macenko or Vahadane normalization. However, these can also alter the biological signal from the image. I'd need to better understand your problem statement to provide a more informed answer.

skambha6 commented 1 month ago

Got it! We were interested in predicting gene expression from the patch embeddings, but it seems from what you're saying that batch effect correction can hurt more than help for this task.

guillaumejaume commented 1 month ago

In HEST-Benchmark we didn't apply additional corrections. I'm sure that performance can be improved. But the big unknown becomes how to ensure good generalization.

skambha6 commented 1 month ago

Okay got it, thank you for the information!