cofactor for arcsinh transformation when a sce object is assembled

sunhuaiyu commented 4 years ago

Dear Dr. Robinson:

I am writing to seek answer for a question I had when using cytofWorkflow to analyze flow cytometry data.

When trying out different cofactor values for prepData(), I was not satisfied with the histograms of some marker (single peak, too narrow). I wonder if I can use different cofactors for individual markers. Do you think this would be a 'safe' approach for the data transformation in cytofWorkflow?

Thank you for your attention. Best regards,

Huaiyu Sun

markrobinsonuzh commented 4 years ago

@sunhuaiyu Thanks for the message.

Yes, you can use different co-factors for each marker and in fact, you can send a vector of cofactors to prepData() .. in the docs, it says:

`cofactor` numeric cofactor(s) to use for optional arcsinh-transformation when `transform = TRUE`; single value or a vector with channels as names.

Cheers, Mark

sunhuaiyu commented 4 years ago

Thank you!

On Sep 3, 2020, at 11:51 PM, markrobinsonuzh notifications@github.com<mailto:notifications@github.com> wrote:

@sunhuaiyuhttps://github.com/sunhuaiyu Thanks for the message.

Yes, you can use different co-factors for each marker and in fact, you can send a vector of cofactors to prepData() .. in the docs, it says:

cofactor numeric cofactor(s) to use for optional arcsinh-transformation when transform = TRUE; single value or a vector with channels as names.

Cheers, Mark

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/markrobinsonuzh/cytofWorkflow/issues/18#issuecomment-686951763, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFS2TFQ6OYZJ6HQOLGFQ73SECE6VANCNFSM4QT63WIQ.

sunhuaiyu commented 4 years ago

Hi Mark,

Thanks again for responding to my question regarding cofactors.

I am trying to process a large dataset over 200 samples and >25 million cells. However, the clustering and dimensionality reduction functions failed to process and crashed the R environment (running on one EC2 with 128G memory). I wonder if you have any suggestion/recommendation on using CyTOF workflow in processing data at this size.

Best regards,

Huaiyu

On Sep 3, 2020, at 11:51 PM, markrobinsonuzh notifications@github.com<mailto:notifications@github.com> wrote:

@sunhuaiyuhttps://github.com/sunhuaiyu Thanks for the message.

Yes, you can use different co-factors for each marker and in fact, you can send a vector of cofactors to prepData() .. in the docs, it says:

cofactor numeric cofactor(s) to use for optional arcsinh-transformation when transform = TRUE; single value or a vector with channels as names.

Cheers, Mark

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/markrobinsonuzh/cytofWorkflow/issues/18#issuecomment-686951763, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFS2TFQ6OYZJ6HQOLGFQ73SECE6VANCNFSM4QT63WIQ.

markrobinsonuzh commented 4 years ago

This is a good question and it's a little more difficult with CyTOF (than scRNA-seq, for example), because you do not have a natural 0, which allows sparse matrices to be used. On the flip side, the dimension of CyTOF should be much smaller ..

A couple thoughts:

25M cells over 200 samples is 125k cells per sample (on average) -- that's a lot! Do you need this level of coverage? Of course, if you are interested in rare populations, then you will be, but you might consider downsampling, either in a simple way (e.g., 20000 cells per sample), or at least setting a cap on the number of cells. Alternatively (and more work) would be to downsample the large populations, like in the immune component, you probably don't need to look at all T cells or B cells.
make sure to remove unnecessary columns in the data (e.g., time, DNA channels, etc.) just to economize on the data that you really need. Same for the colData of the object.
beyond this, there are disk-based options, such as DelayedArray and so on. I haven't need to use these much, but those could be an option.

sunhuaiyu commented 4 years ago

Thank you! I will definitely try down sampling first. Really appreciate a detailed reply. —Huaiyu

On Sep 9, 2020, at 4:01 AM, markrobinsonuzh notifications@github.com<mailto:notifications@github.com> wrote:

This is a good question and it's a little more difficult with CyTOF (than scRNA-seq, for example), because you do not have a natural 0, which allows sparse matrices to be used. On the flip side, the dimension of CyTOF should be much smaller ..

A couple thoughts:

25M cells over 200 samples is 125k cells per sample (on average) -- that's a lot! Do you need this level of coverage? Of course, if you are interested in rare populations, then you will be, but you might consider downsampling, either in a simple way (e.g., 20000 cells per sample), or at least setting a cap on the number of cells. Alternatively (and more work) would be to downsample the large populations, like in the immune component, you probably don't need to look at all T cells or B cells.
make sure to remove unnecessary columns in the data (e.g., time, DNA channels, etc.) just to economize on the data that you really need. Same for the colData of the object.
beyond this, there are disk-based options, such as DelayedArray and so on. I haven't need to use these much, but those could be an option.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/markrobinsonuzh/cytofWorkflow/issues/18#issuecomment-689488290, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFS2TFUUB7GS6UULS4O4VDSE5OBTANCNFSM4QT63WIQ.

sunhuaiyu commented 4 years ago

Hi Mark,

Is there any option to parallelize the tSNE computing in CyTOF workflow?

Also as separate question, is FlowSOM100 the only clustering scheme available in your workflow?

Thank you, Huaiyu

On Sep 9, 2020, at 4:01 AM, markrobinsonuzh notifications@github.com<mailto:notifications@github.com> wrote:

This is a good question and it's a little more difficult with CyTOF (than scRNA-seq, for example), because you do not have a natural 0, which allows sparse matrices to be used. On the flip side, the dimension of CyTOF should be much smaller ..

A couple thoughts:

25M cells over 200 samples is 125k cells per sample (on average) -- that's a lot! Do you need this level of coverage? Of course, if you are interested in rare populations, then you will be, but you might consider downsampling, either in a simple way (e.g., 20000 cells per sample), or at least setting a cap on the number of cells. Alternatively (and more work) would be to downsample the large populations, like in the immune component, you probably don't need to look at all T cells or B cells.
make sure to remove unnecessary columns in the data (e.g., time, DNA channels, etc.) just to economize on the data that you really need. Same for the colData of the object.
beyond this, there are disk-based options, such as DelayedArray and so on. I haven't need to use these much, but those could be an option.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/markrobinsonuzh/cytofWorkflow/issues/18#issuecomment-689488290, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFS2TFUUB7GS6UULS4O4VDSE5OBTANCNFSM4QT63WIQ.

HelenaLC commented 4 years ago

Re parallelisation of t-SNE (and other dimensionality reduction methods): CATALYST's runDR calls scater's runX functions (e.g., runTSNE for t-SNE). Thus, any parameters accepted by these functions can be put as ... in runDR and will be passed to the corresponding runX method. For t-SNE, parallelisation can be achieved via argument BPPARAM (see ?scater::runTSNE for details).

Re other clustering algorithms: While cluster wraps around FlowSOM only, any clustering method can be in principle apply and incorporated in CATALYST's infrastructure. Without wanting to go into details here, please check section 8.2 Using other clustering algorithms in the vignette here.

sunhuaiyu commented 4 years ago

Thanks!

On Sep 17, 2020, at 12:24 AM, Helena L. Crowell notifications@github.com<mailto:notifications@github.com> wrote:

Re parallelisation of t-SNE (and other dimensionality reduction methods): CATALYST's runDR calls scater's runX functions (e.g., runTSNE for t-SNE). Thus, any parameters accepted by these functions can be put as ... in runDR and will be passed to the corresponding runX method. For t-SNE, parallelisation can be achieved via argument BPPARAM (see ?scater::runTSNE for details).

Re other clustering algorithms: While cluster wraps around FlowSOM only, any clustering method can be in principle apply and incorporated in CATALYST's infrastructure. Without wanting to go into details here, please check section 8.2 Using other clustering algorithms in the vignette herehttp://bioconductor.org/packages/release/bioc/vignettes/CATALYST/inst/doc/differential.html.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/markrobinsonuzh/cytofWorkflow/issues/18#issuecomment-694031382, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFS2TAZMQGW62K2XUIS3HLSGG2SLANCNFSM4QT63WIQ.

markrobinsonuzh / cytofWorkflow

cofactor for arcsinh transformation when a sce object is assembled #18

`cofactor` numeric cofactor(s) to use for optional arcsinh-transformation when `transform = TRUE`; single value or a vector with channels as names.