ventolab / CellphoneDB

CellPhoneDB can be used to search for a particular ligand/receptor, or interrogate your own HUMAN single-cell transcriptomics data.
https://www.cellphonedb.org/
MIT License
305 stars 52 forks source link

Using h5ad file and error "Some cells in meta did not exist in counts,Maybe incorrect file format": Invalid Counts data #144

Closed saum-kmr closed 8 months ago

saum-kmr commented 8 months ago

Hello,

I am trying to use cellphoneDB version 4.1 on a dataset of ~72k cells. This dataset is generated using single cell multiome (RNA + ATAC) and multiple batches. Following the guides, I am using Seurat "RC normalized" counts from RNA assay and created the h5ad object for the same. Using the vignette and running the command: "list(adata.obs.index).sort() == list(metadata['barcode_sample']).sort()" returns TRUE.
However everytime I try to run the cpd_analysis_method following the vignette, I always get the error (please see attached) Screenshot 2023-10-13 at 15 01 09

I am not sure what is the issue in my files, please could you suggest what might be wrong?

Many thanks, Saumya

datasome commented 8 months ago

Hi Saumya,

Thank you for using CellphoneDB. As the exception message above says, some bar codes in your meta file are not found in your counts matrix. The expectation is that your meta file should have the same cell bar codes as the ones in your counts matrix. Could you please check?

Best,

Robert.

saum-kmr commented 8 months ago

Hello,

So I had generated the metafile and the h5ad file from same seurat object. When reading it as annadata object AND comparing using

"list(adata.obs.index).sort() == list(metadata['barcode_sample']).sort()" it returns TRUE.

I believe this step is essentially comparing the barcodes which are in the h5ad object and the metadata barcode and it found all of them true.

Is that not the case?

Best, Saumya

datasome commented 8 months ago

Hi Saumya,

As per https://cellphonedb.readthedocs.io/en/latest/RESULTS-DOCUMENTATION.html#input-files - the meta file should have just two columns: Cell and cell_type. I see 'barcode_sample' in your reply above. Could you please fixing the column names in your meta file and trying again?

Best,

Robert.

saum-kmr commented 8 months ago

Hello,

I was using the following tutorial https://github.com/ventolab/CellphoneDB/blob/master/notebooks/T01_Method2.ipynb and since it had barcode_sample as the column name of metadata file "Cell", So I assumed this was not the error.

I changed the metadata column names as suggested and attached is a screenshot of how my files look like. I still get the same error. Screenshot 2023-10-18 at 10 02 01

Please suggest what else can I try?

Best Regards, Saumya

datasome commented 8 months ago

Hi Saumya, My apologies for leading you up a garden path - on second inspection I see the underlying code is more resilient and should be able to cope with barcode_sample as the first column name. You can see the piece of code that throws the original exception in https://github.com/ventolab/CellphoneDB/blob/master/cellphonedb/src/core/preprocessors/counts_preprocessors.py:

if np.any(~meta.index.isin(counts.columns)): ...

hence there must be some cell bar codes in meta file that are not found in the counts file - we just need to get to the bottom of why. You could try to test yourself using the above, or alternatively share the counts and meta files with me and I will take a look? If you put them somewhere accessible, you can send me the link via contact@cellphonedb.org. Either way, do let me know how you got on.

Best,

Robert.

dlukacso commented 8 months ago

Dear Saumya,

Sorry for interrupting the thread. I was having the same issue, and tracked down the problem. Your problem is your cell names. pandas doesn't like column names with dashes ("-") in them. They get changed to periods. Since the count data is converted to a pandas dataframe, (and transposed so that cells are columns), this ends up renaming your cell names.

So for example, the cell "pool5A_AAACAGCCAAAGGTAC-1"? In your metadata file, it is labeled as "pool5A_AAACAGCCAAAGGTAC-1". In your counts file, it is labeled as "pool5A_AAACAGCCAAAGGTAC.1". Obviously, these two are not the same, so the code throws an error. You can fix it by going back to your code that generates your sample files, and renaming them so that they don't have dashes in their names.

I'm not sure if you have any other problems, but this at least seems to be one.

Dear Robert,

Perhaps it would be helpful for others to add a note in the tutorials that cell names with dashes cause errors.

datasome commented 8 months ago

Dear David,

Many thanks for your kind help on this issue - much appreciated - I will amend https://cellphonedb.readthedocs.io/en/latest/RESULTS-DOCUMENTATION.html#input-files accordingly.

Best wishes,

Robert.

saum-kmr commented 8 months ago

Dear David and Robert,

Many thanks, yes this worked. I stripped off all the hyphens and then tool ran.

Best Regards, Saumya