Issue in annotating cell types of unlabelled query data by scPoli

First of all, thank you for developing this impressive toolkit. Much appreciated!

I encountered a problem when using scPoli for cell-type annotation. When I first trained the model on my reference data, and loaded the model for prediction on the query, the annotation result looked fairly reasonable with 8 cell types in total (expected cell types for my query data). However, the second time when I loaded the trained model directly and call the classify function, the result was totally different with only 2 annotated cell types in total. I double checked and rerun it with slight modifications on the parameters but the issue was not resolved. Could you please help me point out the cause of this? Here's the main part of my code:

print('Reading h5ad...')
ref_adata = sc.read_h5ad(path_ref+ref_file)
query_adata = process_query(sc.read(path_query))

common_genes = list(set(query_adata.var_names).intersection(ref_adata.var_names))
ref_adata = ref_adata[:,common_genes]
query_adata = query_adata[:,common_genes]

print('Normalizing reference data...')
sc.pp.normalize_total(ref_adata, target_sum=1e6)
sc.pp.log1p(ref_adata, base=2)
print(ref_adata)

ref_adata.obs['batch'] = ref_adata.obs.sample_id
cell_type_key = ['cell_type','supercluster_term']
condition_key = 'batch'

early_stopping_kwargs = {
    "early_stopping_metric": "val_prototype_loss",
    "mode": "min",
    "threshold": 0,
    "patience": 20,
    "reduce_lr": True,
    "lr_patience": 13,
    "lr_factor": 0.1,
}

scpoli_model = scPoli(
    adata=ref_adata,
    condition_keys=condition_key,
    cell_type_keys=cell_type_key,
    embedding_dims=5,
    recon_loss='nb',
)

print('Start training...')
scpoli_model.train(
    n_epochs=50,
    pretraining_epochs=40,
    early_stopping_kwargs=early_stopping_kwargs,
    eta=5,
)

print('Saving model...')
scpoli_model.save(model_dir, overwrite=True, save_anndata=True)

# Process query data
query_adata.obs['cell_type']='unlabeled'
query_adata.obs['supercluster_term']='unlabeled'
print('Normalizing query data...')
sc.pp.normalize_total(query_adata, target_sum=1e6)
sc.pp.log1p(query_adata, base=2)
print(query_adata)

# Load query data and model
scpoli_query = scPoli.load_query_data(
    adata=query_adata,
    reference_model=model_dir,
    labeled_indices=[],
    )

# Train on query
#scpoli_query.train(
#    n_epochs=30,
#    pretraining_epochs=20,
#    eta=10
#)

# Classification
results_dict = scpoli_query.classify(query_adata, scale_uncertainties=True)

# Get latent representation of query data
data_latent= scpoli_query.get_latent(
    query_adata,
    mean=True
)

adata_latent = sc.AnnData(data_latent)
adata_latent.obs = query_adata.obs.copy()

adata_latent.obs['cell_type_pred'] = results_dict['cell_type']['preds'].tolist()
adata_latent.obs['cell_type_uncert'] = results_dict['cell_type']['uncert'].tolist()
adata_latent.obs['supercluster_term_pred'] = results_dict['supercluster_term']['preds'].tolist()
adata_latent.obs['supercluster_term_uncert'] = results_dict['supercluster_term']['uncert'].tolist()

adata_latent.obs['cell_type_uncert_pass'] = ['T' if x < 0.2 else 'F' for x in adata_latent.obs['cell_type_uncert']]
adata_latent.obs['supercluster_term_uncert_pass'] = ['T' if x < 0.2 else 'F' for x in adata_latent.obs['supercluster_term_uncert']]

print('Predicted cell types of query:')
print(set(adata_latent.obs.cell_type_pred))

Another question: how does 1) normalization, log transformation, subsetting to highly variable genes, 2) whether to include the step of training on the unlabelled query data influence the resulting model?

Thank you in advance!

theislab / scarches

Issue in annotating cell types of unlabelled query data by scPoli #226