morris-lab / Capybara

Capybara: A computational tool to measure cell identity and fate transitions
56 stars 9 forks source link

Creating Bulk Reference from Human Expression Matrices #3

Closed b-russell5 closed 3 years ago

b-russell5 commented 3 years ago

Hi Capybara Development Team,

Thank you for this exciting package! I am interested in using the Capybara workflow with a human dataset and have questions regarding the development of the bulk reference. I see in the Methods section of your manuscript that you filtered the available datasets mined from ARCHS4 to retain only poly-A and total RNA-seq data, and then calculated Pearson’s correlations on every sample pair from the same tissue. Can you provide more detail on this pre-processing workflow? Were the Pearson's correlations calculated between the gene expression matrices of each GEO sample pair? Any code or additional guidance would be greatly appreciated in order to replicate the workflow to create a bulk human reference.

KaetheKong commented 3 years ago

Hello!

Thanks for using Capybara!

On the preprocessing of the bulk datasets, we first selected poly-A and total RNA-seq data and filtered the organism description so that we are only considering WT/healthy mice for reference. Then for the same tissue, we calculated the Pearson's correlations on the gene expression matrices of all GEO samples in a pairwise manner, where we select the multiple most correlated samples from there to consider as the most similar and universal across all samples. I am putting up some simple script for this together right now and I'll upload it as soon as I finish!

Best, Wenjun

b-russell5 commented 3 years ago

Thank you for your response and further explanation. I look forward to seeing the script! Thank you!

KaetheKong commented 3 years ago

Hello!

I have just uploaded two files Step 1 - ARCHS4 Mining.R and Step 2 - bulk cleaning.R under the examples folder.

The step 1 file is mainly for the mining using ARCHS4. It includes our filtering of total RNA and poly A RNA as well as the mouse strain filtering. We also limited the selection of tissues. This tissue list can be selected via looking at the source name of these samples using the following.

samples = h5read("mouse_matrix_download.h5", "meta") unique(tolower(samples$Sample_source_name_ch1))

The list could be quite long and hard to sort through as the source name can include detailed treatment of the samples. This website from ARCHS4 team has a broader list of tissues/samples in their visualization tool, which might be helpful for the selection of tissues list.

The step 2 file is for the cleaning using the bulk file identified via step1. It checks the correlation across samples and tissues. The goal is to have samples with the same tissue to have higher correlation compared to samples with different tissue. And ultimately, the higher correlated samples are used to compute an averaged bulk.

Hope this helps! Best, Wenjun

b-russell5 commented 3 years ago

Hi Wenjun,

Thank you so much for providing the example scripts and additional guidance! They are incredibly helpful and give me a better starting point for creating a human bulk reference. Thanks again!

KaetheKong commented 3 years ago

Happy to help anytime!

Danyi-ZHENG commented 2 years ago

Hello!

Thanks for using Capybara!

On the preprocessing of the bulk datasets, we first selected poly-A and total RNA-seq data and filtered the organism description so that we are only considering WT/healthy mice for reference. Then for the same tissue, we calculated the Pearson's correlations on the gene expression matrices of all GEO samples in a pairwise manner, where we select the multiple most correlated samples from there to consider as the most similar and universal across all samples. I am putting up some simple script for this together right now and I'll upload it as soon as I finish!

Best, Wenjun

Looking forward to your guidance on human data! I am also using it for human dataset analysis. Thank you!

xchromosome219 commented 2 years ago

Hi Wenjun,

I plan to use your pipeline for human data. And tried to download your example scripts and input, but I failed to download it. Would you mind providing that input again?

Many thanks for considering my request.

Haitao Wang