Reading in data with pandas in a server setting is slow. How can I speed this up?

DaianeH commented 4 years ago

I’m using:

data = pd.read_csv("./data.csv", index_col=0)

to read the expression matrix of primary cells downloaded from https://cells.ucsc.edu/?ds=organoidreportcard . There are nearly 200,000 primary cells in this dataset (11GB). Python is taking several hours to read it. I read that pd.read_cvs is not recommended when there’s a large number of columns in the file (I have 189,410). Do you have any suggestion / recommendation to read this and similarly big csv files in a format that would still make CELLEX work?

tstannius commented 4 years ago

Hi DaianeH,

Thank you for submitting this question. To follow-up:

Would you attach your code? Just to see exactly what is going on.
Is the dataframe succesfully loaded?
Could you also give me some details on your system?

Best, Tobias

DaianeH commented 4 years ago

Hi Tobias,

1) Code for reading the data:

import numpy as np 
import pandas as pd
import cellex
data = pd.read_csv("exprMatrix.csv", index_col=0)

2) The dataframe loads, but it takes overnight. My question is: any suggestion to load it faster, instead of using pd.read_csv?

3) Details on the system:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping:              4
CPU MHz:               2701.000
CPU max MHz:           2701.0000
CPU min MHz:           1200.0000
BogoMIPS:              5400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95
768GB of RAM

Thanks!

tstannius commented 4 years ago

Alrighty,

Looks good.
Pandas is indeed slow for this operation. You might try using scanpy as a workaround to load the csv file and then create a dataframe from the AnnData object:

import numpy as np
import pandas as pd
import scanpy as sc
import cellex as cx
import time

start = time.time()

# read data using scanpy
ad = sc.read_text("/scratch/tstannius/organoid.exprMatrix.tsv.gz", delimiter='\t')
meta = pd.read_csv("/scratch/tstannius/organoid.meta.tsv", sep='\t')

# prepare data and metadata for cellex
data = ad.to_df() # cast to df
cellid_map = meta["Cell"].to_dict() # dict for mapping column no to cell_id
data.rename(columns=cellid_map, inplace=True) # rename columns
metadata = pd.Series(meta["Type"].values, index=meta["Cell"]) # create metadata series

# run cellex
eso = cx.ESObject(data=data, annotation=metadata, verbose=True)
eso.compute(verbose=True)
eso.results["esmu"].to_csv("organoid.esmu.csv.gz")

print(time.time()-start)
# del eso
# del data
# del ad

It took about 40 mins to run the above code on the system detailed below. N.B. It may also play a role that data is placed in the /scratch/ partition.

Your system should be more than capable of running this analysis. For comparison, here is the system I ran my analysis on:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    1
Core(s) per socket:    20
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz
Stepping:              1
CPU MHz:               2094.865
BogoMIPS:              4189.73
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-9,40-49
NUMA node1 CPU(s):     10-19,50-59
NUMA node2 CPU(s):     20-29,60-69
NUMA node3 CPU(s):     30-39,70-79

Please let me know if this approach helps and thanks again for highlighting this issue. We will consider prioritizing implementing alternative input formats such as AnnData objects.

DaianeH commented 4 years ago

Yes, this is much better :) Thank you!

perslab / CELLEX

Reading in data with pandas in a server setting is slow. How can I speed this up? #16