Did the extended gene dataset (2000 genes) perform better than the 977 gene dataset?

Please check again the "Transfer Learning" section of the paper:

As we train chemCPA with a Gaussian likelihood loss, the dataset was first normalised and then log(x + 1)-transformed. Depending on the experiment, we further reduced the number of genes included in the single-cell data. In Section 5.1, we first subsetted both datasets to the same 977 genes which were identified via ensemble gene annotations. For the final experiment in Section 5.2, the considered gene set is increased as we hypothesize that more than the 977 L1000 genes are required to capture the variability within the single-cell data. To assess whether pretraining on L1000 is still beneficial in this scenario, we included 1023 highly variable genes (HVGs) from the sci-Plex3 data. That is, we consider 2000 genes in total.

The question is not really if one gene set is better than the other but checking if transfer learning is beneficial in both cases. The extended gene set is naturally harder to predict. since the model has no prior knowledge for the genes that are not present in LINCS.

Citing form the "Extended Gene Set" paragraph of the "Experiment" section:

This is a promising result, as it suggests that the transfer from abundant bulk RNA perturba- tion screens can be leveraged even in scenarios where the gene sets do not match. Crucially, this enables users to benefit from the proposed trans- fer learning and chemCPA’s modelling capacity while simultaneously accounting for the special requirements of scRNA-seq data.

theislab / chemCPA

Did the extended gene dataset (2000 genes) perform better than the 977 gene dataset? #142