lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

NOISY SEGMENTATION in most samples #317

Closed WangKang-Leo closed 1 year ago

WangKang-Leo commented 1 year ago

Describe the issue I followed your suggestion, run the PureCN from scratch. The distribution of purity becomes better, but it is commented on by Noisy Segmentation in most of the samples. Could you please help to look into one sample. To Reproduce

Step1 reference files

Rscript $PURECN/IntervalFile.R --in-file /proj/sens2019581/nobackup/PureCN/reference_files/Twist_Exome_RefSeq_targets_hg38_100bp_padding.bed \ --fasta /sw/data/uppnex/ToolBox/hg38bundle/Homo_sapiens_assembly38.fasta \ --genome hg38 \ --mappability /proj/sens2019581/nobackup/PureCN/reference_files/GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw \ --out-file $OUT_REF/Twist_Exome_RefSeq_targets_hg38_intervals.txt

Step2 Coverage

Rscript $PURECN/Coverage.R --out-dir $OUT \ --bam /proj/sens2019581/nobackup/PureCN/data/normal_bam.list \ --intervals $OUT_REF/Twist_Exome_RefSeq_targets_hg38_intervals.txt \ --cores 64 --parallel Rscript $PURECN/Coverage.R --out-dir $OUT \ --bam /proj/sens2019581/nobackup/PureCN/data/tumor_bam.list \ --intervals $OUT_REF/Twist_Exome_RefSeq_targets_hg38_intervals.txt \ --cores 64 --parallel

NormalDB

Rscript $PURECN/NormalDB.R --out-dir $OUT_REF \ --coverage-files /proj/sens2019581/nobackup/PureCN/data/Normal/BEVPAC_normal_coverages.list \ --genome hg38 --assay Twist_Exome

PureCN run

Rscript $PURECN/PureCN.R --out $OUT/$Sample \ --tumor /proj/sens2019581/nobackup/PureCN/data/Tumor/$Sample".recal_coverage_loess.txt.gz" \ --sampleid $Sample \ --vcf /proj/sens2019581/nobackup/PureCN/data/$Sample".vcf" \ --fun-segmentation PSCBS \ --normaldb $OUT_REF/normalDB_Twist_Exome_hg38.rds \ --intervals $OUT_REF/Twist_Exome_RefSeq_targets_hg38_intervals.txt \ --snp-blacklist $OUT_REF/hg38-blacklist.v2.bed \ --genome hg38 \ --minpurity 0.05 --max-copy-number 8 \ --rds=rda \ --max-non-clonal 0.3 \ --model betabin \ --force --post-optimize --seed 123 \ --cores 32

Log file UE-2971-1106-0.log UE-2971-1106-0.csv

B-allele frequency plot 1693280778723

lima1 commented 1 year ago

This looks pretty decent. You can just increase the --max-segments. For whole exome sequencing of tissue, GATK's segmentation is worth trying.

lima1 commented 1 year ago

--fun-segmentation GATK (with gatk in path).

WangKang-Leo commented 1 year ago

Dear author

Thanks for your prompt reply.

Here I ran the same sample using GATK4 segmentation and set --max-segments as 350. I found it outputs same purity and ploidy, but no "NOISY SEGMENTATION" comment and fewer LOH event (segments). Does it look better than before? I can use them for production?

Thanks/Kang UE-2971-1106-0.log 1694438962680

lima1 commented 1 year ago

Looks good!