perslab / CELLEX

CELLEX (CELL-type EXpression-specificity)
GNU General Public License v3.0
37 stars 9 forks source link

Reading in data with pandas in a server setting is slow. How can I speed this up? #16

Closed DaianeH closed 4 years ago

DaianeH commented 4 years ago

I’m using:

data = pd.read_csv("./data.csv", index_col=0)

to read the expression matrix of primary cells downloaded from https://cells.ucsc.edu/?ds=organoidreportcard . There are nearly 200,000 primary cells in this dataset (11GB). Python is taking several hours to read it. I read that pd.read_cvs is not recommended when there’s a large number of columns in the file (I have 189,410). Do you have any suggestion / recommendation to read this and similarly big csv files in a format that would still make CELLEX work?

tstannius commented 4 years ago

Hi DaianeH,

Thank you for submitting this question. To follow-up:

Best, Tobias

DaianeH commented 4 years ago

Hi Tobias,

1) Code for reading the data:

import numpy as np 
import pandas as pd
import cellex
data = pd.read_csv("exprMatrix.csv", index_col=0)

2) The dataframe loads, but it takes overnight. My question is: any suggestion to load it faster, instead of using pd.read_csv?

3) Details on the system:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping:              4
CPU MHz:               2701.000
CPU max MHz:           2701.0000
CPU min MHz:           1200.0000
BogoMIPS:              5400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95
768GB of RAM

Thanks!

tstannius commented 4 years ago

Alrighty,

  1. Looks good.

  2. Pandas is indeed slow for this operation. You might try using scanpy as a workaround to load the csv file and then create a dataframe from the AnnData object:

import numpy as np
import pandas as pd
import scanpy as sc
import cellex as cx
import time

start = time.time()

# read data using scanpy
ad = sc.read_text("/scratch/tstannius/organoid.exprMatrix.tsv.gz", delimiter='\t')
meta = pd.read_csv("/scratch/tstannius/organoid.meta.tsv", sep='\t')

# prepare data and metadata for cellex
data = ad.to_df() # cast to df
cellid_map = meta["Cell"].to_dict() # dict for mapping column no to cell_id
data.rename(columns=cellid_map, inplace=True) # rename columns
metadata = pd.Series(meta["Type"].values, index=meta["Cell"]) # create metadata series

# run cellex
eso = cx.ESObject(data=data, annotation=metadata, verbose=True)
eso.compute(verbose=True)
eso.results["esmu"].to_csv("organoid.esmu.csv.gz")

print(time.time()-start)
# del eso
# del data
# del ad

It took about 40 mins to run the above code on the system detailed below. N.B. It may also play a role that data is placed in the /scratch/ partition.

  1. Your system should be more than capable of running this analysis. For comparison, here is the system I ran my analysis on:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                80
On-line CPU(s) list:   0-79
Thread(s) per core:    1
Core(s) per socket:    20
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E7-8870 v4 @ 2.10GHz
Stepping:              1
CPU MHz:               2094.865
BogoMIPS:              4189.73
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              51200K
NUMA node0 CPU(s):     0-9,40-49
NUMA node1 CPU(s):     10-19,50-59
NUMA node2 CPU(s):     20-29,60-69
NUMA node3 CPU(s):     30-39,70-79

Please let me know if this approach helps and thanks again for highlighting this issue. We will consider prioritizing implementing alternative input formats such as AnnData objects.

DaianeH commented 4 years ago

Yes, this is much better :) Thank you!