Closed LisaSikkema closed 1 month ago
Hi, thank you for your question. This is because, based on our current defaults, the scVI encoder does not receive batch assignments (encode_covariates=False
) such that changing the batch covariate will not affect the latent representation.
In addition, scArches by default freezes pretrained model parameters (i.e. parameters not related to covariates), which is why training the reference model on the query data with different max KL weights does not change the latent representation (see the last section in our user guide).
These should both, however, have an effect on the decoder output (normalized expression).
Thanks a lot for the quick response!
I'm not sure if I understand though: I get that the model parameters for calculating the reference embedding aren't changed, but this should still allow for changing parameters specific for the query right? That's also how I read the user guide you linked to:
the training of the model with the query data is performed with respect to only the new query-category specific parameters.
As for the batch covariate not affecting the latent representation of the query: how could we still expect the mapping to perform any batch correction in that case?
Is there any way in which I can still change these settings at the stage of mapping, or should it have been done already when the reference model was trained?
I get that the model parameters for calculating the reference embedding aren't changed, but this should still allow for changing parameters specific for the query right?
Right - since the only query-specific parameters added during transfer learning (by default) are the parameters accommodating the new batch covariate categories, these will be updated. Admittedly, this seems a little limited since we end up only updating the decoder, so not exactly sure why we can expect the model to batch correct query data well. But this is the default we have and are currently not changing it due to backwards compatibility.
Pre reference model training, you can change this behavior by specifying encoder_covariates=True
during model initialization, the result being that part of the encoder (and thus the latent representation) being updated during additional query training.
Post reference model training, we have several options in load_query_data
that lets you specify additional parameters to freeze/unfreeze, with unfrozen=True
letting you train all model parameters. For more custom parameter freezing, you would have to manually put in zero grad hooks in the PyTorch module, sorry. Is there a particular part of the model you are interested in updating during query training?
Okay interesting!
It actually sounds like the current default defeats the purpose of scArches as a whole, with no batch correction being performed during the mapping (at least at the level of the latent embedding). Probably also relevant for @M0hammadL to be aware of.
Maybe good to at least stress this very clearly this in the scVI tutorials, to explain that query-to-reference mapping is only really possible later on if encoder_covariates=True
has been set during reference model training.
I'm not so much interested in re-training the reference model, but more in being able to set the way the query model is trained (i.e. including batch already in the encoder, and changing the max_kl_weight setting). If I understand the documentation correctly, that is not really possible with the freeze/unfreeze parameters in load_query_data
is it?
Good point - probably a good idea to point this out in the tutorials. I wasn't involved in writing this part of scvi-tools, so I'll also double check that this is actually what is happening, but from what I recall, this is the case with our defaults.
Yeah, encoding batch information when the reference model has already been trained without it is not possible yet. For max KL weight, I do believe the changes should be reflected during query training, but let me double check.
Getting into the thread here, in all scArches tutorials it's highlighted that covariates are encoded. In our hands it doesn't make a difference for most datasets. OOD things shouldn't be done (like taking an sc model and using sn query). However, generally when reusing a model it always makes sense to check that integration was working and posterior predictive checks look good (in short reconstruction loss in query and reference is similar). This will be stressed in our publication encompassing scVI-criticism.
Thanks for your responses!
Right yes in the scArches tutorials the covariates are set that way (although not explained as far as I remember, so it might not be clear to users that those settings are important not to change), but they're not set that way nor discussed in the scVI /scANVI tutorials I think. Most people won't go through an scArches tutorial already when they're still integrating the data, and will only realize that their model isn't really scArches-friendly once they have already done a lot of quality checks and downstream analysis on their integration, and don't want to re-do it anymore. I have already experienced it with two large atlas integrations that I worked with in the past weeks.
Also, for me it makes quite a large difference to set parameters differently in the models where this is possible, especially setting the batch covariate differently, but also e.g. the KL divergence, so I wouldn't say it does not make much of a difference whether covariates are encoded.
We added a comment to the tutorial to highlight this.
Hi! I have been trying to map data to scvi-integrated embeddings using scArches, and have noticed that for the two scVI-based reference models I have used, setting the
max_kl_weight
parameter differently has no effect on the output. The same holds for changing the batch assignment of the query cells (e.g. using sample rather than dataset as batch covariate). I do not see the same problem with an scANVI-model I have used.Here's a reproducible example of the first issue:
Which outputs the following (most importantly, 2x "True" at the bottom):
and of the second issue:
which outputs (again most importantly: "True" at the bottom)
Versions:
scArches: '0.5.8' scvi-tools: '0.20.1'
The reference model I used is public, I downloaded it with this link: https://zenodo.org/records/10139343/files/pb_tissue_normal_ref_scvi.zip?download=1
This is the query dataset I use in the example: PeerMassague2020_subset_for_testing.h5ad.zip
Any idea why max_kl_weight and the batch covariate have zero effect on the output? This should not be the case, as far as I understand.