gr.ligrec with cellphonedb mode not producing the same results as base cellphonedb

zktuong commented 3 years ago

Hi, I tried to compare gr.ligrec with base cellphonedb results and noticed that they were not returning the same results and was wondering why. I've attached my notebook as a pdf: testing_cpdb.pdf

giovp commented 3 years ago

hey @zktuong , thank you for the interest in squidpy! we'd be happy to help you reproduce the results, could you share the cellphonedb command that you use from the terminal ?

also, couple of things I noticed that could be the reason for the mismatch:

the pvalues are modified in place with this: pvalues[np.isclose(pvalues, 0.0)] = 0.001 # since i only did 1k iterations. This could be a source of variation.
the colorbar of the mean counts do not match, so I wonder if you used the same raw (or normalized) count data for both methods...
one line printed in stdout from cellphone db is: [ ][CORE][25/02/21-00:13:01][INFO] Using custom database at /nfs/users/nfs_k/kt16/.cpdb/releases/v2.0.0/cellphone.db . I'd have to double check what's the current version of the database from omnipath

thank you!

zktuong commented 3 years ago

hi @giovp

Thanks for this.

we'd be happy to help you reproduce the results, could you share the cellphonedb command that you use from the terminal ?

I used:

cellphonedb method statistical_analysis meta.txt rna.h5ad --output-path=out --iterations=1000 --threads=23 --threshold=0.1

this is shown in [ ][CORE][25/02/21-00:13:33][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:-1 Threads:23 Precision:3

the pvalues are modified in place with this: pvalues[np.isclose(pvalues, 0.0)] = 0.001 # since i only did 1k iterations. This could be a source of variation.

Perhaps; If i don't do the two lines pl.ligrec plotting will fail. But those interactions should still be retained since the p-value in base cellphonedb is 0 (technically <0.05)? The FDR option wasn't specified in my gr.ligrec command so it's not the issue.

the colorbar of the mean counts do not match, so I wonder if you used the same raw (or normalized) count data for both methods...

I used normalized counts for both -> it's the same .h5ad file with only the .X slot containing log1p values.

one line printed in stdout from cellphone db is: [ ][CORE][25/02/21-00:13:01][INFO] Using custom database at /nfs/users/nfs_k/kt16/.cpdb/releases/v2.0.0/cellphone.db . I'd have to double check what's the current version of the database from omnipath

good idea. Although i think this hasn't changed for a very long time. (IIRC it's been the same version for the past 2 years).

michalk8 commented 3 years ago

@zktuong I've found a bug which caused some malformed gene names (that would be subsequently filtered out - will be fixed in #310 ), but this doesn't fix the issue . Numerically, the means result should perfectly match the original cpdb implementation, as for the p-values, there's ~0.69 correlation, see the images in https://github.com/theislab/squidpy_reproducibility/pull/10

I haven't checked how the original cpdb database differs from the one omnipath provides, but internally, we just call:

import omnipath
omnipath.clear_cache()  # just in case

df = omnipath.interactions.import_intercell_network(interactions_params={"resource": "CellPhoneDB"},
                                                    receiver_params={"category": "receptor"},
                                                    transmitter_params={"category": "ligand"})
# produces a dataframe of shape (970, 46)

For now, I think the best option would be to supply the list of interactions manually.

zktuong commented 3 years ago

Thanks @michalk8 and @giovp.

i wonder if this is related: https://github.com/Teichlab/cellphonedb/pull/288#issuecomment-802419260 .to_numpy() seems to cause shuffle_meta() not actually shuffling in spawned processes. have to revert back to using .values.

There was a very noticeable speed 'increase' when i tried it; instead of 1.5 hours, the original cellphonedb run finished in 5-10 min for the same 200k cell dataset i had in the pdf above. I didn't check the results but was told that it produced very different results in other data and it was reverted.

giovp commented 3 years ago

@zktuong we'll follow this up with omnipath core database as well because it might be that they either don't have the latest cellphonedb database or they do some pre-filtering when it's queried.

i wonder if this is related: Teichlab/cellphonedb#288 (comment) .to_numpy() seems to cause shuffle_meta() not actually shuffling in spawned processes. have to revert back to using .values.

not sure about this sorry, what was that comment referring to?

zktuong commented 3 years ago

so the original cellphonedb does the shuffling on a pandas dataframe like so:

def shuffle_meta(meta: pd.DataFrame) -> pd.DataFrame:
    """
    Permutates the meta values aleatory generating a new meta file
    """
    meta_copy = meta.copy()
    np.random.shuffle(meta_copy['cell_type'])

the comment was trying to change it to np.random.shuffle(meta_copy['cell_type'].values) to prevent a warning message and it turns out that np.random.shuffle(meta_copy['cell_type'].values) and np.random.shuffle(meta_copy['cell_type'].to_numpy()) is doing different things which @nh3 is saying that the latter fails to shuffle. At least that's how i understood it.

nh3 commented 3 years ago

so the original cellphonedb does the shuffling on a pandas dataframe like so:
def shuffle_meta(meta: pd.DataFrame) -> pd.DataFrame:
    """
    Permutates the meta values aleatory generating a new meta file
    """
    meta_copy = meta.copy()
    np.random.shuffle(meta_copy['cell_type'])
the comment was trying to change it to np.random.shuffle(meta_copy['cell_type'].values) to prevent a warning message and it turns out that np.random.shuffle(meta_copy['cell_type'].values) and np.random.shuffle(meta_copy['cell_type'].to_numpy()) is doing different things which @nh3 is saying that the latter fails to shuffle. At least that's how i understood it.

np.random.shuffle(meta_copy['cell_type'].to_numpy()) does shuffle meta_copy in normal situation. That behaviour might have something to do with the way parallelisation was implemented in cellphonedb and in my opinion is likely unrelated to the issue here.

giovp commented 3 years ago

Ok, I'm a bit lost here, we don't shuffle over the pandas df, the shuffling is initialized here: https://github.com/theislab/squidpy/blob/aef5b916f6daa4623b03ac6ddc4538859082d58e/squidpy/gr/_ligrec.py#L795

and then applied here: https://github.com/theislab/squidpy/blob/aef5b916f6daa4623b03ac6ddc4538859082d58e/squidpy/gr/_ligrec.py#L823

the parallelization happens across genes. I assume this is not relevant right @zktuong ?

However, the error that you report is in a PR right? Not on master? so we can assume that results obtained from master are the ground truth. The problem then really seems motivated by a different annotation afaik. Will follow up in pypath repo and report here.

zktuong commented 3 years ago

ah ok sorry my bad. sure.

michalk8 commented 3 years ago

@giovp @zktuong you might be interested in an in-depth explanation @deeenes did a few days back: https://github.com/saezlab/pypath/issues/171#issuecomment-809707239

When trying the changes, I'd use squidpy from master and clean the omnipath cache as:

import omnipath as op
op.clear_cache()

giovp commented 3 years ago

thanks @michalk8 for looking into this!

@zktuong as mentioned, the inconsistencies that could arise are due to pre-processing choices of omnipath in building the annotation database (which have nevertheless been mitigated imho, see latest version of comparison here: https://github.com/theislab/squidpy_reproducibility/pull/10 , both re-run with latest master in both squidpy 5a53dfd and cellphonedb 714bbbc ) .

In the same notebook, you can also appreciate that the statistics/numerics highly correlate, and therefore inconsistencies only arise due to the annotation choices.

Let us know if that help and and whether more clarifications are needed.

giovp commented 3 years ago

@zktuong I'm gonna go on and close the issue, for reference:

reproducibility notebook is here: https://github.com/theislab/squidpy_reproducibility/blob/master/notebooks/reproducibility/cellphonedb.ipynb
discussion on building the ligand-receptor annotation is here: https://github.com/saezlab/pypath/issues/171

feel free to reopen if unhappy with the resolution!

zktuong commented 3 years ago

Hi,

sorry for this extremely late reply - got distracted by a lot of things. Anyway, just want to say a quick thank you all for all of this and quick + thorough investigation/explanation.

Cheers, Kelvin

scverse / squidpy

gr.ligrec with cellphonedb mode not producing the same results as base cellphonedb #308