morris-lab / Capybara

Capybara: A computational tool to measure cell identity and fate transitions
56 stars 9 forks source link

Sample and Reference Single-Cell Dataset Preprocessing #4

Closed b-russell5 closed 3 years ago

b-russell5 commented 3 years ago

Hi Capybara team,

Are raw count data (for the sample single-cell dataset and reference single-cell dataset) used for input into the Capybara pipeline? Is there any need for normalization or other pre-processing (ie. QC filtering, scaling, dimensional reduction) before using these datasets with Capybara? If so, where in the Capybara workflow should this occur? I see that the tutorial reads "Selection of your preferred normalization method can be applied to raw counts. Here, we will demonstrate the usage of the bulk raw counts in the pipeline." I am confused which datasets should be normalized (query and single-cell reference?) and which datasets should be kept as raw counts (bulk reference?) for input into the Capybara pipeline.

Thank you for the package and your support.

KaetheKong commented 3 years ago

Hello!

Sorry for the confusion in the document! The bulk reference and single-cell reference work a bit differently from each other.

For the bulk part, the RPKM (normalized) reference is preferred for comparison. It will be logged and scaled to the comparing test single-cell data during the Capybara process before the first step.

For the single-cell part, the raw counts are preferred for the both the reference single-cell data and test single-cell data into Capybara. In Capybara, there is an integrated normalization process before running the first step, where both the test and reference are normalized and scaled to a similar scale. The data can be normalized or preprocessed by other platforms as well. While inputting normalized or preprocessed dataset into the pipeline, the two parameters bulk.norm (for the reference) and norm.sc (for the test single-celll data) into the function single.round.QP.analysis can be set to FALSE. These two parameters indicates if the integrated normalized processing be performed. If TRUE, normalization will be perform.

Hope this helps clarify this part of the pipeline! Best, Wenjun

b-russell5 commented 3 years ago

Hi Wenjun, Thank you for the clarification - that makes sense! I understand now. Thanks again!

KaetheKong commented 3 years ago

Glad it makes sense! Thanks again for using Capybara!