Closed DaianeH closed 4 years ago
Hi DaianeH,
Thank you for submitting this question. To follow-up:
Best, Tobias
Hi Tobias,
1) Code for reading the data:
import numpy as np
import pandas as pd
import cellex
data = pd.read_csv("exprMatrix.csv", index_col=0)
2) The dataframe loads, but it takes overnight. My question is: any suggestion to load it faster, instead of using pd.read_csv?
3) Details on the system:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping: 4
CPU MHz: 2701.000
CPU max MHz: 2701.0000
CPU min MHz: 1200.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
768GB of RAM
Thanks!
Alrighty,
Looks good.
Pandas is indeed slow for this operation. You might try using scanpy as a workaround to load the csv file and then create a dataframe from the AnnData
object:
import numpy as np
import pandas as pd
import scanpy as sc
import cellex as cx
import time
start = time.time()
# read data using scanpy
ad = sc.read_text("/scratch/tstannius/organoid.exprMatrix.tsv.gz", delimiter='\t')
meta = pd.read_csv("/scratch/tstannius/organoid.meta.tsv", sep='\t')
# prepare data and metadata for cellex
data = ad.to_df() # cast to df
cellid_map = meta["Cell"].to_dict() # dict for mapping column no to cell_id
data.rename(columns=cellid_map, inplace=True) # rename columns
metadata = pd.Series(meta["Type"].values, index=meta["Cell"]) # create metadata series
# run cellex
eso = cx.ESObject(data=data, annotation=metadata, verbose=True)
eso.compute(verbose=True)
eso.results["esmu"].to_csv("organoid.esmu.csv.gz")
print(time.time()-start)
# del eso
# del data
# del ad
It took about 40 mins to run the above code on the system detailed below.
N.B. It may also play a role that data is placed in the /scratch/
partition.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 1
Core(s) per socket: 20
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 2094.865
BogoMIPS: 4189.73
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 51200K
NUMA node0 CPU(s): 0-9,40-49
NUMA node1 CPU(s): 10-19,50-59
NUMA node2 CPU(s): 20-29,60-69
NUMA node3 CPU(s): 30-39,70-79
Please let me know if this approach helps and thanks again for highlighting this issue. We will consider prioritizing implementing alternative input formats such as AnnData
objects.
Yes, this is much better :) Thank you!
I’m using:
data = pd.read_csv("./data.csv", index_col=0)
to read the expression matrix of primary cells downloaded from https://cells.ucsc.edu/?ds=organoidreportcard . There are nearly 200,000 primary cells in this dataset (11GB). Python is taking several hours to read it. I read that pd.read_cvs is not recommended when there’s a large number of columns in the file (I have 189,410). Do you have any suggestion / recommendation to read this and similarly big csv files in a format that would still make CELLEX work?