Closed zktuong closed 3 years ago
hey @zktuong , thank you for the interest in squidpy! we'd be happy to help you reproduce the results, could you share the cellphonedb command that you use from the terminal ?
also, couple of things I noticed that could be the reason for the mismatch:
pvalues[np.isclose(pvalues, 0.0)] = 0.001 # since i only did 1k iterations
. This could be a source of variation.[ ][CORE][25/02/21-00:13:01][INFO] Using custom database at /nfs/users/nfs_k/kt16/.cpdb/releases/v2.0.0/cellphone.db
. I'd have to double check what's the current version of the database from omnipaththank you!
hi @giovp
Thanks for this.
we'd be happy to help you reproduce the results, could you share the cellphonedb command that you use from the terminal ?
I used:
cellphonedb method statistical_analysis meta.txt rna.h5ad --output-path=out --iterations=1000 --threads=23 --threshold=0.1
this is shown in
[ ][CORE][25/02/21-00:13:33][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:-1 Threads:23 Precision:3
- the pvalues are modified in place with this:
pvalues[np.isclose(pvalues, 0.0)] = 0.001 # since i only did 1k iterations
. This could be a source of variation.
Perhaps; If i don't do the two lines pl.ligrec
plotting will fail.
But those interactions should still be retained since the p-value in base cellphonedb is 0 (technically <0.05)?
The FDR option wasn't specified in my gr.ligrec
command so it's not the issue.
- the colorbar of the mean counts do not match, so I wonder if you used the same raw (or normalized) count data for both methods...
I used normalized counts for both -> it's the same .h5ad
file with only the .X
slot containing log1p values.
- one line printed in stdout from cellphone db is:
[ ][CORE][25/02/21-00:13:01][INFO] Using custom database at /nfs/users/nfs_k/kt16/.cpdb/releases/v2.0.0/cellphone.db
. I'd have to double check what's the current version of the database from omnipath
good idea. Although i think this hasn't changed for a very long time. (IIRC it's been the same version for the past 2 years).
@zktuong I've found a bug which caused some malformed gene names (that would be subsequently filtered out - will be fixed in #310 ), but this doesn't fix the issue . Numerically, the means result should perfectly match the original cpdb implementation, as for the p-values, there's ~0.69 correlation, see the images in https://github.com/theislab/squidpy_reproducibility/pull/10
I haven't checked how the original cpdb database differs from the one omnipath provides, but internally, we just call:
import omnipath
omnipath.clear_cache() # just in case
df = omnipath.interactions.import_intercell_network(interactions_params={"resource": "CellPhoneDB"},
receiver_params={"category": "receptor"},
transmitter_params={"category": "ligand"})
# produces a dataframe of shape (970, 46)
For now, I think the best option would be to supply the list of interactions manually.
Thanks @michalk8 and @giovp.
i wonder if this is related: https://github.com/Teichlab/cellphonedb/pull/288#issuecomment-802419260
.to_numpy() seems to cause shuffle_meta() not actually shuffling in spawned processes. have to revert back to using .values.
There was a very noticeable speed 'increase' when i tried it; instead of 1.5 hours, the original cellphonedb run finished in 5-10 min for the same 200k cell dataset i had in the pdf above. I didn't check the results but was told that it produced very different results in other data and it was reverted.
@zktuong we'll follow this up with omnipath core database as well because it might be that they either don't have the latest cellphonedb database or they do some pre-filtering when it's queried.
i wonder if this is related: Teichlab/cellphonedb#288 (comment) .to_numpy() seems to cause shuffle_meta() not actually shuffling in spawned processes. have to revert back to using .values.
not sure about this sorry, what was that comment referring to?
so the original cellphonedb does the shuffling on a pandas dataframe like so:
def shuffle_meta(meta: pd.DataFrame) -> pd.DataFrame:
"""
Permutates the meta values aleatory generating a new meta file
"""
meta_copy = meta.copy()
np.random.shuffle(meta_copy['cell_type'])
the comment was trying to change it to np.random.shuffle(meta_copy['cell_type'].values)
to prevent a warning message and it turns out that np.random.shuffle(meta_copy['cell_type'].values)
and np.random.shuffle(meta_copy['cell_type'].to_numpy())
is doing different things which @nh3 is saying that the latter fails to shuffle. At least that's how i understood it.
so the original cellphonedb does the shuffling on a pandas dataframe like so:
def shuffle_meta(meta: pd.DataFrame) -> pd.DataFrame: """ Permutates the meta values aleatory generating a new meta file """ meta_copy = meta.copy() np.random.shuffle(meta_copy['cell_type'])
the comment was trying to change it to
np.random.shuffle(meta_copy['cell_type'].values)
to prevent a warning message and it turns out thatnp.random.shuffle(meta_copy['cell_type'].values)
andnp.random.shuffle(meta_copy['cell_type'].to_numpy())
is doing different things which @nh3 is saying that the latter fails to shuffle. At least that's how i understood it.
np.random.shuffle(meta_copy['cell_type'].to_numpy())
does shuffle meta_copy
in normal situation. That behaviour might have something to do with the way parallelisation was implemented in cellphonedb
and in my opinion is likely unrelated to the issue here.
Ok, I'm a bit lost here, we don't shuffle over the pandas df, the shuffling is initialized here: https://github.com/theislab/squidpy/blob/aef5b916f6daa4623b03ac6ddc4538859082d58e/squidpy/gr/_ligrec.py#L795
and then applied here: https://github.com/theislab/squidpy/blob/aef5b916f6daa4623b03ac6ddc4538859082d58e/squidpy/gr/_ligrec.py#L823
the parallelization happens across genes. I assume this is not relevant right @zktuong ?
However, the error that you report is in a PR right? Not on master? so we can assume that results obtained from master are the ground truth. The problem then really seems motivated by a different annotation afaik. Will follow up in pypath repo and report here.
ah ok sorry my bad. sure.
@giovp @zktuong you might be interested in an in-depth explanation @deeenes did a few days back: https://github.com/saezlab/pypath/issues/171#issuecomment-809707239
When trying the changes, I'd use squidpy
from master and clean the omnipath
cache as:
import omnipath as op
op.clear_cache()
thanks @michalk8 for looking into this!
@zktuong as mentioned, the inconsistencies that could arise are due to pre-processing choices of omnipath in building the annotation database (which have nevertheless been mitigated imho, see latest version of comparison here: https://github.com/theislab/squidpy_reproducibility/pull/10 , both re-run with latest master in both squidpy 5a53dfd and cellphonedb 714bbbc ) .
In the same notebook, you can also appreciate that the statistics/numerics highly correlate, and therefore inconsistencies only arise due to the annotation choices.
Let us know if that help and and whether more clarifications are needed.
@zktuong I'm gonna go on and close the issue, for reference:
feel free to reopen if unhappy with the resolution!
Hi,
sorry for this extremely late reply - got distracted by a lot of things. Anyway, just want to say a quick thank you all for all of this and quick + thorough investigation/explanation.
Cheers, Kelvin
Hi, I tried to compare gr.ligrec with base cellphonedb results and noticed that they were not returning the same results and was wondering why. I've attached my notebook as a pdf: testing_cpdb.pdf