Tasks to address when adding option to use“full” TCGA reference cohort

umccr / RNAsum

Pipeline for generating RNAseq-based cancer patient reports

Other

7 stars 4 forks source link

Consider the following matters to address when adding the option to use the “full” TCGA patients reference cohort

Use static plots (instead of interactive ones, in particular those with per-sample data points within the "Input data summary" and "Expression profiles" sections) to reduce the run time as well as the size of the final report
Switch off saving the expression data (expression matrixes) and summary tables since they are computationally intense and produce big files, which are used only for RNA data portal
Look at the Addendum run time to check which time-consuming code chunks can be skipped to reduce the run time
Create separate "RNAsum.data" repo with expression matrix files including the “full” TCGA patients reference cohort

I run RNAsum using the “full” and "partial" TCGA patients reference cohort options for the following samples:

SBJ04426 BRCA SBJ04187 BRCA SBJ04296 BRCA SBJ01649 PANCAN SBJ04469 PANCAN SBJ02061 PANCAN SBJ02091 PANCAN SBJ04376 PANCAN SBJ04408 PANCAN

Attached are summary plots illustrating the following:

RNAsum processing time by sample
RNAsum processing time by chunk
RNAsum report size by sample

Based on the "RNAsum processing time by chunk" chart , the following R code chunks are the most computationally demanding (comments in "()" indicate whether respective chunks can be skipped using the "full" TCGA reference option):

(can be skipped) data_transformation_plot (keep) glance_expr_plot_immune_genes (keep) pca (keep) glance_expr_plot_cancer_genes (can be skipped) data_transformation_display (keep) glance_expr_plot_hrd_genes (keep) top_hits_fusions (keep) unnamed-chunk-1 (keep) rle

I'd also skip "data_normalisation_plot", "scree_combined_data_display" and "rle_display" chunks since these are not readable given the number of included samples.

RNAsum processing time by sample RNAsum report size by sample RNAsum processing time by chunk

umccr / RNAsum

Tasks to address when adding option to use“full” TCGA reference cohort #164