openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
290 stars 77 forks source link

[Batch integration feature] HVG conservation easily hacked #744

Closed scottgigante-immunai closed 1 year ago

scottgigante-immunai commented 1 year ago

HVG conservation checks conservation of the top min(500, n_genes/2) genes. This means methods are incentivised to choose a maximum of 1000 genes. I can easily bump up method with random performance to higher than the best actual method just by randomly choosing 1000 genes. You can probably do this with a real method and do even better.

>>> import openproblems
>>> adata = openproblems.tasks.batch_integration_feature.datasets.immune_batch()
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata)
1.0
>>> adata_bad = adata.copy()
>>> adata_bad.X = adata_bad.X.copy()
>>> adata_bad.X.data = np.random.uniform(0, 1, adata_bad.X.data.shape)
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata_bad)
0.03600000000000001
>>> batches_all_nonzero = np.all([np.any(adata_bad[adata_bad.obs['batch'] == b].X.toarray() > 0, axis=0) for b in adata_bad.obs['batch'].unique()], axis=0)
>>> adata_bad_sub = adata_bad[:,batches_all_nonzero][:,:1000].copy()
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata_bad_sub)
0.5237999999999999

Here is it with combat:

>>> adata_combat = openproblems.tasks.batch_integration_feature.methods.combat_hvg_unscaled(adata)
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata_combat)
0.42460000000000003
>>> batches_all_nonzero = np.all([np.any(adata_combat[adata_combat.obs['batch'] == b].X.toarray() > 0, axis=0) for b in adata_combat.obs['batch'].unique()], axis=0)
>>> adata_combat_sub = adata_combat[:,batches_all_nonzero][:,:1000].copy()
>>> openproblems.tasks.batch_integration_feature.metrics.hvg_conservation(adata_combat_sub)
0.6061707817240907

Probably the best thing to do here is explicitly subset to the 1000 most variable genes pre-integration in the metric first, before computing the post-integration variable genes.

LuckyMD commented 1 year ago

@danielStrobl We discussed this in the meeting today, and it would make sense to pre-select the top 1000 or 2000 HVGs (from pre-integration batches) before subselecting from those to run this metric.