The results of fastCN-smk

zhoudreames commented 1 year ago

Following your pipeline, I want to estimate the copy number of CKD20 genes in chr14 for no-human species, and show you some steps in below

#1. First, I extra subset of bam and ref.fa files for chr14 and chrX , and exit config.yamls and manifest.tbl files
>config.yamls
reference: "chr14_chrX"
fasta: chr14_chrX.fa
sd: chr14_chrX.SD_lowidSD.bed
nchunks: 100
manifest: manifest.tbl
#
RepeatMasker: chr14_chrX.repeatmasker.out.bed
trf:   chr14_chrX.trf.bed
gaps: chr14_chrX.gap.bed
windowmasker: chr14_chrX.dust.bed.gz

>manifest.tbl
sample  reads
chr14_chrX  chr14_chrX.bam
#2. run the pipeline with snakemake
 snakemake --configfile config.yaml --cores 100

After the task ran successfully, I got many result files and I try to guess the final result of the pipeline, which may be results/test-reference/tracks/bed9/wssd/chr14_chrX.bed.gz file. Then, I got genome location information for 10 copies of the CDK20 gene ：

>CKD20.gene.bed
chr14   10440813        10467260
chr14   12592476        12593495
chr14   12597478        12598864
chr14   12984135        12992502
chr14   14839469        14847923
chr14   15451029        15477492
chr14   18309334        18335583
chr14   20048637        20061521
chr14   23532803        23552098
chr14   25822165        25830699

run the bedtools intersect -a chr14_chrX.bed.gz -b CKD20.gene.bed -wa >CDK20.CN.bed:

chr14   15455764        15457244        15      0       +       0       0       255,0,0 14.811280738083717
chr14   15457244        15458497        21      0       +       0       0       0,255,255       20.570753164315505
chr14   15458497        15460287        14      0       +       0       0       255,0,0 14.270734788228092
chr14   15460287        15461833        19      0       +       0       0       255,0,0 18.76034870060004
chr14   15461833        15463917        17      0       +       0       0       255,0,0 16.852522246664194
chr14   15463917        15469972        127     0       +       0       0       255,192,203     126.87696493514846
chr14   15469972        15470972        138     0       +       0       0       255,192,203     137.98576168071256
chr14   15470972        15473085        124     0       +       0       0       255,192,203     124.3283957386648
chr14   15473085        15474837        87      0       +       0       0       139,0,139       87.3425533777707
chr14   15474837        15476108        70      0       +       0       0       75,0,130        69.61878957813792
chr14   15476108        15477108        156     0       +       0       0       255,192,203     155.50658903142093
chr14   15477108        15478108        153     0       +       0       0       255,192,203     153.20165165648092
chr14   18308097        18309780        22      0       +       0       0       0,255,255       22.00398398766052
chr14   18309780        18311216        17      0       +       0       0       255,0,0 16.949622607427447
chr14   18311216        18312643        21      0       +       0       0       0,255,255       20.837628369564428
chr14   18312643        18313643        16      0       +       0       0       255,0,0 15.880915285175243
chr14   18313643        18314643        16      0       +       0       0       255,0,0 15.72303518577596
chr14   18314643        18316208        17      0       +       0       0       255,0,0 17.292540100417114
chr14   18316208        18317977        18      0       +       0       0       255,0,0 18.47539706454207
chr14   18317977        18319527        19      0       +       0       0       255,0,0 19.056455466714944
.....

The copy number of the CDK20 gene varies greatly(9~166 copies), but most are around 20 copies, which is consistent with the actual copy of the genome.
I want to know how to obtain the final copy number this CDK20 gene , such as using the average or median of the intersection results. in addition, Is there any problem with the results file I use or the method for evaluating gene copy numbers？

zhoudreames commented 1 year ago

@mrvollger @wharvey31 I am so sorry to disturb you， I posted too many questions. Would it be convenient for you to answer？ Thank you~

mrvollger commented 1 year ago

Copy number can often vary over the gene model.

I would visualize the results against the gene model and try to select the subregion that you believe to be representative for that gene. Here is a link to a representative example for a human gene where I highlight the region I would select: http://genome.ucsc.edu/s/mrvollger/wssd-example

If you cannot do that for some reason I would default to the median.

mrvollger / fastCN-smk

The results of fastCN-smk #2