whelena commented 11 months ago

Description

Added a QC plot made by @WuSelina for visualizing clone distribution across the genome

Closes #88

Pipeline Run Results

Case 1
- sample: MSK-AB-0014
- input tsv: /hot/software/package/public-R-CancerEvolutionVisualization/development/test_input/multi-sample.tsv
- output: /hot/software/package/public-R-CancerEvolutionVisualization/development/hwinata-add-genome-distribution-plot/no-defaults
- Checklist
[x] This PR does NOT contain Protected Health Information (PHI). A repo may need to be deleted if such data is uploaded.
Disclosing PHI is a major problem[^1] - Even a small leak can be costly[^2].
[x] This PR does NOT contain germline genetic data[^3], RNA-Seq, DNA methylation, microbiome or other molecular data[^4].

[^1]: UCLA Health reaches $7.5m settlement over 2015 breach of 4.5m patient records [^2]: The average healthcare data breach costs $2.2 million, despite the majority of breaches releasing fewer than 500 records. [^3]: Genetic information is considered PHI. Forensic assays can identify patients with as few as 21 SNPs [^4]: RNA-Seq, DNA methylation, microbiome, or other molecular data can be used to predict genotypes (PHI) and reveal a patient's identity.

[x] This PR does NOT contain other non-plain text files, such as: compressed files, images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other output files.

To automatically exclude such files using a .gitignore file, see here for example.

[x] I have read the code review guidelines and the code review best practice on GitHub check-list.
[x] I have set up or verified the main branch protection rule following the github standards before opening this pull request.
[x] The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
[x] I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

whelena commented 11 months ago

For chromosome information, I'm currently just readiung from a .tsv file i stuck into data/chr.info. I don't know if this is the best approach so any comments is much appreciated.

WuSelina commented 11 months ago

For chromosome information, I'm currently just readiung from a .tsv file i stuck into data/chr.info. I don't know if this is the best approach so any comments is much appreciated.

I think the way you have it is good. I am also not sure of what the recommended standard is, though.

I previously used get.chr.length() from the bedr package which gives lengths for GRCh38 if specified, but this function does not return 'GC_count' or 'GC_percent' and returns out-of-order chrs (19 & 20 are swapped), so I have been reordering the resulting dataframe just in case:

# Get chr lengths info chr.len <- get.chr.length(build = 'hg38'); # Keep only chrs 1-22 and sex chrs and remove 'chr' prefix chr.len$chr <- gsub('chr', '', chr.len$chr); chr.len <- subset(chr.len, subset = chr %in% c(1:22, 'X', 'Y'));

# Reorder the chr info chrom.order <- c(as.character(1:22), 'X', 'Y'); # Convert the 'chr' column to a factor with custom levels chr.len$chr <- factor(chr.len$chr, levels = chrom.order); # Sort the dataframe based on the order of the 'chr' column chr.len <- chr.len[order(chr.len$chr), ];

whelena commented 4 months ago

@WuSelina Could you double check the density calculation to get the counts? The density.df$scaled.y code was not doing what i think it should be doing and giving me really small numbers. The function is under create.clone.genome.distribution.densityplot.R. Thanks!

whelena commented 4 months ago

R-CMD-check is failing due to missing documentation, which I will be fixing in a separate PR.

uclahs-cds / package-CancerEvolutionVisualization

Hwinata add genome distribution plot #97

Description

Closes #88

Pipeline Run Results

Checklist