scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.86k stars 595 forks source link

Exporting raw data in CSV #506

Closed tsotnech closed 5 years ago

tsotnech commented 5 years ago

Hi Guys,

this it perhaps rather a question than issue, Is there a way to export raw data in csv format? If I do this adata.write_csvs("filename", skip_data=False)

it works perfectly fine

but with adata.raw.write_csvs("filename", skip_data=False)

I get this error AttributeError: 'Raw' object has no attribute 'write_csvs'

Thanks,

LuckyMD commented 5 years ago

Hi,

You should be able to convert the adata.raw.X object into a pandas dataframe and use the pd.Dataframe.to_csv() function. You will have to write the raw csvs separately for adata.raw.X, adata.raw.obs and adata.raw.var though. The last two are already dataframes, so no need to convert.

So like this:

pd.Dataframe(adata.raw.X).to_csv(filename_raw_x)
adata.raw.obs.to_csv(filename_raw_obs)
adata.raw.var.to_csv(filename_raw_var)
tsotnech commented 5 years ago

Thanks for the reply, unfortunately pandas dataframe conversion gives me this error

pd.DataFrame(adata.raw.X).to_csv(filename_raw_x)

ValueError: DataFrame constructor not properly called!

LuckyMD commented 5 years ago

Hi, Sorry, I guessed the convention a bit ;). I think it should be something like: pd.DataFrame(data=adata.raw.X, index=adata.raw.obs_names, column=adata.raw.var_names)

tsotnech commented 5 years ago

@LuckyMD Thanks a lot, that option did work, however I might have an issue somewhere else. when I ran this command it creates a data frame with different values per cell and gene. Then I went to check my raw.X and if I do

print(adata.raw.X) I get this kind of values

(0, 2005)   2.724785
  (0, 2004) 0.8859608
  (0, 2003) 2.1791992
  (0, 2001) 3.498613
  (0, 2000) 2.855959
  (0, 1985) 0.8859608
  (0, 1984) 0.5380458
  (0, 1974) 0.5380458
  (0, 1950) 1.3482361

but if I look at adata.X it has values already after scaling and looks like this

[[-0.33361623 -0.39783627  0.3157311  ...  0.74563605  1.3691179
  -0.5147692 ]
 [-0.2754761  -0.3423106   0.57043296 ... -0.2636034   1.3486573
   2.9965923 ]

I have saved raw slot right before scaling the data, I was expected that it would have just normalized values cell/gene similar to adata.X

maximilianh commented 5 years ago

You can also use the cellbrowser export function in tools. It will create two tab-sep files, one with the raw matrix and one with the meta data.

On Wed, Feb 27, 2019 at 11:33 PM tsotnech notifications@github.com wrote:

@LuckyMD https://github.com/LuckyMD Thanks a lot, that option did work, however I might have an issue somewhere else. when I ran this command it creates a data frame with different values per cell and gene. Then I went to check my raw.X and if I do

print(adata.raw.X) I get this kind of values

(0, 2005) 2.724785 (0, 2004) 0.8859608 (0, 2003) 2.1791992 (0, 2001) 3.498613 (0, 2000) 2.855959 (0, 1985) 0.8859608 (0, 1984) 0.5380458 (0, 1974) 0.5380458 (0, 1950) 1.3482361

but if I look at adata.X it has values already after scaling and looks like this

[[-0.33361623 -0.39783627 0.3157311 ... 0.74563605 1.3691179 -0.5147692 ] [-0.2754761 -0.3423106 0.57043296 ... -0.2636034 1.3486573 2.9965923 ]

I have saved raw slot right before scaling the data, I was expected that it would have just normalized values cell/gene similar to adata.X

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/506#issuecomment-468058012, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TXEzjH7oDb3RfOw35VNQDQbRyndbks5vRwfKgaJpZM4bThrQ .

LuckyMD commented 5 years ago

@tsotnech I don't entirely understand your problem. If you use sc.pp.scale() on your anndata object you will scale the gene expression data to have a mean of 0 and a variance of 1 in adata.X while adata.raw.X remains unchanged. So then your data will look different.

Also, it looks like @maximilianh's suggestion is a lot nicer than the code I provided.

tsotnech commented 5 years ago

@LuckyMD @maximilianh Thanks guys for the reply. Sorry I'm just used to Seurat setup and kind of got lost. Long story short my issue is negative values, after data processing and scaling I have negative values in the expression matrix that throughs off my downstream analysis. But before scaling adata.X format looks completely different (as I mentioned in my previous post). I just want to have a matrix of gene/cell.

If I export using cell browser tool I get same values as processed adata.X If I do adata.to_df().to_csv('./adata.csv', sep=',')

or if I do

import scanpy.external as sce
sce.exporting.cellbrowser(adata, './test', 'adata', embedding_keys=None, annot_keys=['louvain'], cluster_field='louvain')

it generates exactly same expression matrix, I don't really see the raw value matrix

LuckyMD commented 5 years ago

Hi again,

As I was trying to explain before, the sc.pp.scale() function scales gene expression values to be centred at 0 and have a variance of 1. That will necessarily produce negative expression values. Maybe the function you were intending to use is sc.pp.normalize_per_cell()? That would normalize the gene expression values to give counts per million (or counts per some other constant).

tsotnech commented 5 years ago

@LuckyMD raw data before scaling has all of these "coordinates" e.g. (0, 2005) basically what value is assigned to what cell and gene right? When I try to export raw slot all of these "coordinates" gets exported with the values in a weird way. however after scaling those coordinates are gone as I showed before I get scaled values and I understand I can be negative, which is totally fine. I just want to extract raw or pp.normalized values without those "coordinates".

LuckyMD commented 5 years ago

I think I understand what you mean. The "coordinate" representation you are referring to is a sparse matrix format. If you don't want that representation, you can densify your data using the function: adata.X.toarray(). I think you should be able to do the same with adata.X.raw as well.

Scaling removes the sparse matrix formatting, as the matrix is no longer sparse after scaling. In other words, scaling replaces (nearly) all of the 0s, so you get a dense format.

I hope that is what you're looking for.

maximilianh commented 5 years ago

I'll modify the cell browser export function to export the raw matrix by default. I think this has come up before and it's pretty easy to do.

On Fri, Mar 1, 2019 at 5:17 PM MalteDLuecken notifications@github.com wrote:

I think I understand what you mean. The "coordinate" representation you are referring to is a sparse matrix format. If you don't want that representation, you can densify your data using the function: adata.X.toarray(). I think you should be able to do the same with adata.X.raw as well.

Scaling removes the sparse matrix formatting, as the matrix is no longer sparse after scaling. In other words, scaling replaces (nearly) all of the 0s, so you get a dense format.

I hope that is what you're looking for.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/506#issuecomment-468720014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TTZbYCkWB5t_P-XEj8xOgTs5O3j_ks5vSVKYgaJpZM4bThrQ .

tsotnech commented 5 years ago

@LuckyMD @maximilianh Thanks guys for your help and sorry for misunderstanding I think I wasn't explaining correctly my issue.

So at the end this worked perfectly well to export raw data matrix

t=adata.raw.X.toarray()
pd.DataFrame(data=t, index=adata.obs_names, columns=adata.raw.var_names).to_csv('adata_raw_x.csv')