[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the main branch of openproblems.
HVG conservation checks conservation of the top min(500, n_genes/2) genes. This means methods are incentivised to choose a maximum of 1000 genes. I can easily bump up method with random performance to higher than the best actual method just by randomly choosing 1000 genes. You can probably do this with a real method and do even better.
>>> adata_combat = openproblems.tasks.batch_integration_feature.methods.combat_hvg_unscaled(adata)
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata_combat)
0.42460000000000003
>>> batches_all_nonzero = np.all([np.any(adata_combat[adata_combat.obs['batch'] == b].X.toarray() > 0, axis=0) for b in adata_combat.obs['batch'].unique()], axis=0)
>>> adata_combat_sub = adata_combat[:,batches_all_nonzero][:,:1000].copy()
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata_combat_sub)
0.6061707817240907
Probably the best thing to do here is explicitly subset to the 1000 most variable genes pre-integration in the metric first, before computing the post-integration variable genes.
@danielStrobl We discussed this in the meeting today, and it would make sense to pre-select the top 1000 or 2000 HVGs (from pre-integration batches) before subselecting from those to run this metric.
main
branch of openproblems.HVG conservation checks conservation of the top
min(500, n_genes/2)
genes. This means methods are incentivised to choose a maximum of 1000 genes. I can easily bump up method with random performance to higher than the best actual method just by randomly choosing 1000 genes. You can probably do this with a real method and do even better.Here is it with combat:
Probably the best thing to do here is explicitly subset to the 1000 most variable genes pre-integration in the metric first, before computing the post-integration variable genes.