Generation of adata.h5ad by using motif specific command line and Cell Ranger output files

pinellolab / STREAM

STREAM: Single-cell Trajectories Reconstruction, Exploration And Mapping of single-cell data

http://stream.pinellolab.org

GNU Affero General Public License v3.0

168 stars 45 forks source link

Generation of adata.h5ad by using motif specific command line and Cell Ranger output files #101

Closed sylestiel closed 3 years ago

sylestiel commented 3 years ago

Hi,

I have tried several times to generate the adata.h5ad, zscores_scaled.tsv.gz, and zscores.tsv.gz files using command line and the following script and my cellranger output data:

$ stream_atac -c ./filtered_peak_bc_matrix/matrix.mtx -r ./filtered_peak_bc_matrix/peaks.bed -s ./filtered_peak_bc_matrix/barcodes.tsv --file_format mtx -g mm10 -f motif --n_jobs 3 -o stream_output

Although it worked with no problem for ~3K cells dataset it appears to going on endlessly for a dataset with >7K cells.

I let it run for a couple of weeks and then closed it. Can you suggest a way to expedite the generation of adata and zscore files.

Thank you!

huidongchen commented 3 years ago

Hi,

Sorry about the error. Instead of running stream_atac, can you try to run the R script run_preprocess.R directly in the same environment?

Rscript ./run_preprocess.R -c ./filtered_peak_bc_matrix/matrix.mtx -r ./filtered_peak_bc_matrix/peaks.bed -s ./filtered_peak_bc_matrix/barcodes.tsv --file_format mtx -g mm10 -f motif --n_jobs 3 -o stream_output

It might be something related to the package rpy2.

huidongchen commented 3 years ago

Also to speed up the whole procedure, you can try to increase n_jobs. STREAM internally calls chromVAR to get the zscore matrix. Just for your reference, in our previous benchmark study, with ~5k cells and 44 cpus, the part takes ~30 mins

sylestiel commented 3 years ago

Thank you! I will give it a try.

sylestiel commented 3 years ago

You wrote : _can you try to run the R script runpreprocess.R directly in the same environment?

Is this in Terminal or in Jupyter Notebook. Need more clarification.

huidongchen commented 3 years ago

Rscript ./run_preprocess.R -c ./filtered_peak_bc_matrix/matrix.mtx -r ./filtered_peak_bc_matrix/peaks.bed -s ./filtered_peak_bc_matrix/barcodes.tsv --file_format mtx -g mm10 -f motif --n_jobs 3 -o stream_output needs to be run in your terminal. You can simply replace your stream_atac command line with the Rscript command line.

sylestiel commented 3 years ago

So do I start of by conda activate myenv R Rscript ./run_preprocess.R -c ./filtered_peak_bc_matrix/matrix.mtx -r ./filtered_peak_bc_matrix/peaks.bed -s ./filtered_peak_bc_matrix/barcodes.tsv --file_format mtx -g mm10 -f motif --n_jobs 3 -o stream_output

huidongchen commented 3 years ago

First, you need to download the script run_preprocess.R to your local machine (e.g. under the directory ~/your_workdir

Then in your terminal,

$conda activate myenv
$Rscript ~/your_workdir/run_preprocess.R -c ./filtered_peak_bc_matrix/matrix.mtx -r ./filtered_peak_bc_matrix/peaks.bed -s ./filtered_peak_bc_matrix/barcodes.tsv --file_format mtx -g mm10 -f motif --n_jobs 3 -o stream_output

You are all set!

sylestiel commented 3 years ago

$ Rscript /Volumes/BKUP2/R_projects/Stream/run_preprocess.R -c /Volumes/BKUP2/scATAC_data/72020_scATAC/SH2_E165/outs/filtered_peak_bc_matrix/matrix.mtx -r /Volumes/BKUP2/scATAC_data/72020_scATAC/SH2_E165/outs/filtered_peak_bc_matrix/peaks.bed -s /Volumes/BKUP2/scATAC_data/72020_scATAC/SH2_E165/outs/filtered_peak_bc_matrix/barcodes.tsv --file_format mtx -g mm10 -f motif --n_jobs 3 -o stream_output

Error: unexpected '<' in "<"

Execution halted

Can you catch the error here?

huidongchen commented 3 years ago

Sorry I am not sure about this error. I gave it a try on an example data. it works well on my machine.

sylestiel commented 3 years ago

It is working. I downloaded the wrong file previously.

huidongchen commented 3 years ago

Awesome. once it's finished, you can run the following code snippet to read it into STREAM-compatible object

import pandas as pd
import anndata as ad
from sklearn import preprocessing
import stream as st
df_zscores = pd.read_csv('zscores.tsv.gz',sep='\t',index_col=0)
df_zscores_scaled = preprocessing.scale(df_zscores,axis=1)
df_zscores_scaled = pd.DataFrame(df_zscores_scaled,index=df_zscores.index,columns=df_zscores.columns)
adata = ad.AnnData(X=df_zscores_scaled.values.T, obs={'obs_names':df_zscores_scaled.columns},var={'var_names':df_zscores_scaled.index})
st.set_workdir(adata,'./stream_result')

sylestiel commented 3 years ago

huidongchen commented 3 years ago

As I showed above, you need to import several other libraries:

import pandas as pd
import anndata as ad
from sklearn import preprocessing
import stream as st

sylestiel commented 3 years ago

Hi,

A new error:

No adata file in the stream_results folder!!! Suggestions?

huidongchen commented 3 years ago

Awesome. once it's finished, you can run the following code snippet to read it into STREAM-compatible object

import pandas as pd
import anndata as ad
from sklearn import preprocessing
import stream as st
df_zscores = pd.read_csv('zscores.tsv.gz',sep='\t',index_col=0)
df_zscores_scaled = preprocessing.scale(df_zscores,axis=1)
df_zscores_scaled = pd.DataFrame(df_zscores_scaled,index=df_zscores.index,columns=df_zscores.columns)
adata = ad.AnnData(X=df_zscores_scaled.values.T, obs={'obs_names':df_zscores_scaled.columns},var={'var_names':df_zscores_scaled.index})
st.set_workdir(adata,'./stream_result')

Instead of st.read(), please use the above codes I mentioned before

sylestiel commented 3 years ago

It appears to be running but I don't see any file that is called adata within the stream_result folder. Is that to be expected?

huidongchen commented 3 years ago

You don't need to run the step4 and step7 in your notebook.

You can skip to step 8 in this tutorial

sylestiel commented 3 years ago

Thank you very much Huidong!