theislab / scib

Benchmarking analysis of data integration tools
MIT License
295 stars 63 forks source link

HVG selection by intersect #18

Closed LuckyMD closed 5 years ago

LuckyMD commented 5 years ago

@danielStrobl @mumichae It looks like we just keep going until we have 4k HVGs in the intersection in this function. https://github.com/theislab/Benchmarking_data_integration/blob/81aa23057223970e75c193687b1f2f36b6a56478/scIB/preprocessing.py#L208-L213

I thought we agreed on:

  1. Using 2k HVGs
  2. Never going to more than 8k HVGs per dataset to find the intersection HVG list of 2000.

If this is how all the methods were integrated so far, we will have to redo everything... @mbuttner

LuckyMD commented 5 years ago

Also, if I see this correctly, then the scripts are not set up to do integration without HVG selection. Is that correct?

danielStrobl commented 5 years ago

I'm very sorry, I forgot that. I will fix that right now.

Why would they not be set up to do integration without hvg selection?

LuckyMD commented 5 years ago

Depends what we say is "integration without HVG selection". I thought we wanted to do 10k HVGs, then integrate, then subset to 2k after integration and calculate the embeddings?

The latter would have to be done as we need to have the same end-point for all integration methods. Then you can run metrics on them.

Alternatively we can also run without subsetting at all. That seems a bit reckless on the methods though. Any opinions on this @mbuttner, @martaint, and @le-ander (we spoke about this today).

martaint commented 5 years ago

Uhm, I can just say that 10k HVGs would not be possible to get in my immune cell dataset (human+mouse), since the number of genes left after orthologous gene selection is ~8,000. Maybe in that case it would be better not subsetting at all. Hope it's helpful :)

Best,

Marta

On Tue, 3 Sep 2019 at 14:55, MalteDLuecken notifications@github.com wrote:

Depends what we say is "integration without HVG selection". I thought we wanted to do 10k HVGs, then integrate, then subset to 2k after integration and calculate the embeddings?

The latter would have to be done as we need to have the same end-point for all integration methods. Then you can run metrics on them.

Alternatively we can also run without subsetting at all. That seems a bit reckless on the methods though. Any opinions on this @mbuttner https://github.com/mbuttner, @martaint https://github.com/martaint, and @le-ander https://github.com/le-ander (we spoke about this today).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/Benchmarking_data_integration/issues/18?email_source=notifications&email_token=AHR32OT3QTZS26KFD4P3GL3QHZNDNA5CNFSM4ITFYXE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YC5YA#issuecomment-527445728, or mute the thread https://github.com/notifications/unsubscribe-auth/AHR32OTEMYGSHRWHVUPVCVTQHZNDNANCNFSM4ITFYXEQ .

danielStrobl commented 5 years ago

The issue with the 4k HVGs and the 8k cutoff should be fixed now

le-ander commented 5 years ago

I think not subsetting at all could be really interesting. I that works it would be a super clean way of doing things (running hvg selection only after integration)

LuckyMD commented 5 years ago

The issue with the 4k HVGs and the 8k cutoff should be fixed now

https://github.com/theislab/Benchmarking_data_integration/blob/dded6445d4b6dc6ef9abfb5ccb54ff86fc81d378/scIB/preprocessing.py#L218

I would replace the int(max_genes/2) by 1000. That seems like a sensible step.

Also, it would be good to output the number of HVGs that are used in the end. If it's less than 500, we should prob throw an error?

Also, you left some git merge stuff in here: https://github.com/theislab/Benchmarking_data_integration/blob/dded6445d4b6dc6ef9abfb5ccb54ff86fc81d378/scIB/preprocessing.py#L182-L187

LuckyMD commented 5 years ago

Okay, let's try no HVG selection. Agreed that it's def interesting.

Then we'd need HVG selection and sc.tl.pca(adata, use_highly_variable=True) in the integration scripts for post integration though, no? Or put this in another function for post-processing the datasets?

danielStrobl commented 5 years ago

I would replace the int(max_genes/2) by 1000. That seems like a sensible step.

The thing is that it than takes way longer to run the function for larger numbers of HVGs

Then we'd need HVG selection and sc.tl.pca(adata, use_highly_variable=True) in the integration >scripts for post integration though, no? Or put this in another function for post-processing the >datasets?

Yes, I would probably do this post-integration.

LuckyMD commented 5 years ago

Steps of 1000 would still leave us with max 7 steps. Doesn't sound like a bottleneck when you're comparing with other integration.

Or actually, the best approach is just be to calculate HVGs once, and then you filter yourself on the adata.var['dispersions_norm']. Then you don't recalculate dispersions again every time.

Yes, I would probably do this post-integration.

Then we need a function that tests if we already have an adata.obsm['X_pca'] or not and then runs the function. Hmm... or just run it every time for the "no hvg" approaches, and only run it on the cases where we don't get an embedding back for the "hvg" approaches.

LuckyMD commented 5 years ago

Btw, I changed the hvg option in runIntegration to now take a number of HVGs. It defaults to 2000 genes, but you have to set -v 0 to not use hvgs. This was necessary so that you can state more than 2k genes for atac-seq data. So @mbuttner, you can't run with -v flag anymore unless you put a value after it. Just run without -v to get the behaviour of running with -v from before. This was backward-breaking.

mbuttner commented 5 years ago

Hi Malte, I have the following error when I try to run the most recent runIntegration.py script:

(sc-tutorial) maren.buettner /storage/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scripts $ ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama_hvg2000.h5ad -b tech -v 2000 -m scanorama > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama.out
Traceback (most recent call last):
  File "./runIntegration.py", line 59, in <module>
    runIntegration(file, out, run, hvg, batch, max_genes_hvg)
  File "./runIntegration.py", line 13, in runIntegration
    if hvg > 500:
TypeError: '>' not supported between instances of 'str' and 'int'
(sc-tutorial) maren.buettner /storage/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scripts $ ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama_hvg2000.h5ad -b tech -v '2000' -m scanorama > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama.out
Traceback (most recent call last):
  File "./runIntegration.py", line 59, in <module>
    runIntegration(file, out, run, hvg, batch, max_genes_hvg)
  File "./runIntegration.py", line 13, in runIntegration
    if hvg > 500:
TypeError: '>' not supported between instances of 'str' and 'int'
LuckyMD commented 5 years ago

fixed in 396c65e7218fc71a091ceae27a08861bfbf95c6a

mbuttner commented 5 years ago

The HVG selection issue is a bit more tricky. The current implementation does not cover the case that there are not enough HVGs at all to fulfil the requirements. In the pancreas data case, I have at most 400 HVGs (so tells me the HVG selection), and the resulting anndata object created by scIB.preprocessing.hvg_intersect(adata, batch, adataOut=True, target_genes=hvg, n_stop=max_genes_hvg) fails the sanity check. I suggest to relax the HVG selection such that we use all genes, which are highly variable a) in the majority of datasets or b) in n-1 datasets.

Here's the error message:

> ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama_hvg2000.h5ad -b tech -v 2000 -m scanorama > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama.out -g 8000
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Traceback (most recent call last):
  File "./runIntegration.py", line 59, in <module>
    runIntegration(file, out, run, hvg, batch, max_genes_hvg)
  File "./runIntegration.py", line 16, in runIntegration
    integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 599, in measureTM
    out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True) 
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
    returned = f(*args, **kw)
  File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 28, in runScanorama
    checkSanity(adata, batch, hvg)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/utils.py", line 25, in checkSanity
    checkAdata(adata)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/utils.py", line 9, in checkAdata
    raise TypeError('Input is not a valid AnnData object')
TypeError: Input is not a valid AnnData object
LuckyMD commented 5 years ago

Error handling in runIntegration needs to be better if hvg_intersect() throws an error.

You can increase the max_genes_hvg value from 8k to 10k. That should work better. The approach you suggest should be in the sc.pp.highly_variable_genes() already with the batch_key parameter.

mbuttner commented 5 years ago

OK, thanks for the feedback. I tried different values of max_genes_hvg to see when an invalid anndata object is returned. In my case, the highest value was 7000 (see error message below). For 8000, an invalid adata object was returned.

Traceback (most recent call last):
  File "./runIntegration.py", line 59, in <module>
    runIntegration(file, out, run, hvg, batch, max_genes_hvg)
  File "./runIntegration.py", line 14, in runIntegration
    adata = scIB.preprocessing.hvg_intersect(adata, batch, adataOut=True, target_genes=hvg, n_stop=max_genes_hvg)
  File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/preprocessing.py", line 228, in hvg_intersect
    raise Exception(f'Only {len(intersect)} HVGs were found in the intersection.\n'
Exception: Only 305 HVGs were found in the intersection.
This is fewer than 500 HVGs set as the minimum.
Consider raising `n_stop` or reducing `n_genes`.
LuckyMD commented 5 years ago

This is solved via the batch_hvg() function.