Closed LuckyMD closed 5 years ago
Also, if I see this correctly, then the scripts are not set up to do integration without HVG selection. Is that correct?
I'm very sorry, I forgot that. I will fix that right now.
Why would they not be set up to do integration without hvg selection?
Depends what we say is "integration without HVG selection". I thought we wanted to do 10k HVGs, then integrate, then subset to 2k after integration and calculate the embeddings?
The latter would have to be done as we need to have the same end-point for all integration methods. Then you can run metrics on them.
Alternatively we can also run without subsetting at all. That seems a bit reckless on the methods though. Any opinions on this @mbuttner, @martaint, and @le-ander (we spoke about this today).
Uhm, I can just say that 10k HVGs would not be possible to get in my immune cell dataset (human+mouse), since the number of genes left after orthologous gene selection is ~8,000. Maybe in that case it would be better not subsetting at all. Hope it's helpful :)
Best,
Marta
On Tue, 3 Sep 2019 at 14:55, MalteDLuecken notifications@github.com wrote:
Depends what we say is "integration without HVG selection". I thought we wanted to do 10k HVGs, then integrate, then subset to 2k after integration and calculate the embeddings?
The latter would have to be done as we need to have the same end-point for all integration methods. Then you can run metrics on them.
Alternatively we can also run without subsetting at all. That seems a bit reckless on the methods though. Any opinions on this @mbuttner https://github.com/mbuttner, @martaint https://github.com/martaint, and @le-ander https://github.com/le-ander (we spoke about this today).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/Benchmarking_data_integration/issues/18?email_source=notifications&email_token=AHR32OT3QTZS26KFD4P3GL3QHZNDNA5CNFSM4ITFYXE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YC5YA#issuecomment-527445728, or mute the thread https://github.com/notifications/unsubscribe-auth/AHR32OTEMYGSHRWHVUPVCVTQHZNDNANCNFSM4ITFYXEQ .
The issue with the 4k HVGs and the 8k cutoff should be fixed now
I think not subsetting at all could be really interesting. I that works it would be a super clean way of doing things (running hvg selection only after integration)
The issue with the 4k HVGs and the 8k cutoff should be fixed now
I would replace the int(max_genes/2)
by 1000
. That seems like a sensible step.
Also, it would be good to output the number of HVGs that are used in the end. If it's less than 500, we should prob throw an error?
Also, you left some git merge stuff in here: https://github.com/theislab/Benchmarking_data_integration/blob/dded6445d4b6dc6ef9abfb5ccb54ff86fc81d378/scIB/preprocessing.py#L182-L187
Okay, let's try no HVG selection. Agreed that it's def interesting.
Then we'd need HVG selection and sc.tl.pca(adata, use_highly_variable=True)
in the integration scripts for post integration though, no? Or put this in another function for post-processing the datasets?
I would replace the
int(max_genes/2)
by1000
. That seems like a sensible step.
The thing is that it than takes way longer to run the function for larger numbers of HVGs
Then we'd need HVG selection and sc.tl.pca(adata, use_highly_variable=True) in the integration >scripts for post integration though, no? Or put this in another function for post-processing the >datasets?
Yes, I would probably do this post-integration.
Steps of 1000 would still leave us with max 7 steps. Doesn't sound like a bottleneck when you're comparing with other integration.
Or actually, the best approach is just be to calculate HVGs once, and then you filter yourself on the adata.var['dispersions_norm']
. Then you don't recalculate dispersions again every time.
Yes, I would probably do this post-integration.
Then we need a function that tests if we already have an adata.obsm['X_pca']
or not and then runs the function. Hmm... or just run it every time for the "no hvg" approaches, and only run it on the cases where we don't get an embedding back for the "hvg" approaches.
Btw, I changed the hvg
option in runIntegration
to now take a number of HVGs. It defaults to 2000 genes, but you have to set -v 0
to not use hvgs. This was necessary so that you can state more than 2k genes for atac-seq data. So @mbuttner, you can't run with -v
flag anymore unless you put a value after it. Just run without -v
to get the behaviour of running with -v
from before. This was backward-breaking.
Hi Malte, I have the following error when I try to run the most recent runIntegration.py script:
(sc-tutorial) maren.buettner /storage/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scripts $ ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama_hvg2000.h5ad -b tech -v 2000 -m scanorama > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama.out
Traceback (most recent call last):
File "./runIntegration.py", line 59, in <module>
runIntegration(file, out, run, hvg, batch, max_genes_hvg)
File "./runIntegration.py", line 13, in runIntegration
if hvg > 500:
TypeError: '>' not supported between instances of 'str' and 'int'
(sc-tutorial) maren.buettner /storage/groups/ml01/workspace/maren.buettner/data_integration/Benchmarking_data_integration/scripts $ ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama_hvg2000.h5ad -b tech -v '2000' -m scanorama > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama.out
Traceback (most recent call last):
File "./runIntegration.py", line 59, in <module>
runIntegration(file, out, run, hvg, batch, max_genes_hvg)
File "./runIntegration.py", line 13, in runIntegration
if hvg > 500:
TypeError: '>' not supported between instances of 'str' and 'int'
fixed in 396c65e7218fc71a091ceae27a08861bfbf95c6a
The HVG selection issue is a bit more tricky. The current implementation does not cover the case that there are not enough HVGs at all to fulfil the requirements. In the pancreas data case, I have at most 400 HVGs (so tells me the HVG selection), and the resulting anndata object created by scIB.preprocessing.hvg_intersect(adata, batch, adataOut=True, target_genes=hvg, n_stop=max_genes_hvg)
fails the sanity check.
I suggest to relax the HVG selection such that we use all genes, which are highly variable a) in the majority of datasets or b) in n-1 datasets.
Here's the error message:
> ./runIntegration.py -i /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/human_pancreas_norm.h5ad -o /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama_hvg2000.h5ad -b tech -v 2000 -m scanorama > /storage/groups/ml01/workspace/maren.buettner/data_integration/data/human_pancreas/integrated/human_pancreas_scanorama.out -g 8000
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Trying to set attribute `.var` of view, making a copy.
Traceback (most recent call last):
File "./runIntegration.py", line 59, in <module>
runIntegration(file, out, run, hvg, batch, max_genes_hvg)
File "./runIntegration.py", line 16, in runIntegration
integrated_tmp = scIB.metrics.measureTM(method, adata, batch)
File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/metrics.py", line 599, in measureTM
out = memory_profiler.memory_usage((prof.runcall, args, kwargs), retval=True)
File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/site-packages/memory_profiler.py", line 336, in memory_usage
returned = f(*args, **kw)
File "/home/icb/daniel.strobl/miniconda3/envs/sc-tutorial/lib/python3.7/cProfile.py", line 109, in runcall
return func(*args, **kw)
File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/integration.py", line 28, in runScanorama
checkSanity(adata, batch, hvg)
File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/utils.py", line 25, in checkSanity
checkAdata(adata)
File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/utils.py", line 9, in checkAdata
raise TypeError('Input is not a valid AnnData object')
TypeError: Input is not a valid AnnData object
Error handling in runIntegration
needs to be better if hvg_intersect()
throws an error.
You can increase the max_genes_hvg
value from 8k to 10k. That should work better. The approach you suggest should be in the sc.pp.highly_variable_genes()
already with the batch_key parameter.
OK, thanks for the feedback. I tried different values of max_genes_hvg
to see when an invalid anndata object is returned. In my case, the highest value was 7000
(see error message below). For 8000
, an invalid adata object was returned.
Traceback (most recent call last):
File "./runIntegration.py", line 59, in <module>
runIntegration(file, out, run, hvg, batch, max_genes_hvg)
File "./runIntegration.py", line 14, in runIntegration
adata = scIB.preprocessing.hvg_intersect(adata, batch, adataOut=True, target_genes=hvg, n_stop=max_genes_hvg)
File "/home/icb/daniel.strobl/Benchmarking_data_integration/scIB/preprocessing.py", line 228, in hvg_intersect
raise Exception(f'Only {len(intersect)} HVGs were found in the intersection.\n'
Exception: Only 305 HVGs were found in the intersection.
This is fewer than 500 HVGs set as the minimum.
Consider raising `n_stop` or reducing `n_genes`.
This is solved via the batch_hvg()
function.
@danielStrobl @mumichae It looks like we just keep going until we have 4k HVGs in the intersection in this function. https://github.com/theislab/Benchmarking_data_integration/blob/81aa23057223970e75c193687b1f2f36b6a56478/scIB/preprocessing.py#L208-L213
I thought we agreed on:
If this is how all the methods were integrated so far, we will have to redo everything... @mbuttner