how to estimate the copy number pf specific CDK20 gene in no human genome

zhoudreames commented 1 year ago

For our no human genome, the CDK20 genes contain 10 copies， I want to estimate the copy number of CDK20 at the population-level. Following your pipeline, I run the following code to calculate it.

1.building Homology table
parascopy pretable -f genome.fa -o pretable.bed.gz
parascopy table -i pretable.bed.gz -f genome.fa -o table.bed.gz

2.Estimating copy number
# Calculate background read depth.
parascopy depth -I input.list -b windows.bed  -f genome.fa -o depth
# Estimate agCN and psCN for multiple samples.
parascopy cn -I input.list -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1

there are two customer input files, including the windows.bed and regions.bed. the former windows.bed were built from #4 , the regions.bed was shown in below, contained the location of ten CDK20 genes copies:

chr14   10440813        10467260        CDK20
chr14   12592476        12593495        CDK20
chr14   12597478        12598864        CDK20
chr14   12984135        12992502        CDK20
chr14   14839469        14847923        CDK20
chr14   15451029        15477492        CDK20
chr14   18309334        18335583        CDK20
chr14   20048637        20061521        CDK20
chr14   23532803        23552098        CDK20
chr14   25822165        25830699        CDK20

I don't know if the region.bed is correct? if not, how to make the .bed file for the pipeline?

zhoudreames commented 1 year ago

@tprodanov I am so sorry to disturb you, would you help me? Thanks~

tprodanov commented 1 year ago

Hi @zhoudreames ,

Yes, it is more or less correct. I have two notes:

First, it is better to name the regions differently (fourth column), for example CDK20a, CDK20b, etc. Alternatively, you can leave out the fourth column at all, and just provide the bed file with three columns.
Second, If all these regions are copies of a single duplication, you can provide just one of them, it will output copy number of all copies in the output res.paralog.bed.gz file anyway.

Additionally, by default Parascopy does not call paralog-specific copy number for such high-copy duplications. You can change that with --pscn-bound parameter, but probably it would not work (as there are exponential number of possible paralog-specific copy number tuples, it will take too much time). However, Parascopy will still output aggregate copy number across all repeat copies, which can also be helpful.

In any case, please try to run this command and write to me, if you have further questions!

zhoudreames commented 1 year ago

Hi @tprodanov I am so sorry to see your reply too late, thank you for your detailed response; I followed your second method, and I got a good result, but there are some things I don't quite understand: First, my code is shown below:

#1. I selected one of the CDK20 genes with regions.bed
chr14   18309334        18335583

#2. rerun the paracopy
parascopy cn -i self.bam -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1 --psc-bound 20

After the application completes， I got three main filed, including res.matrix.bed.gz, res.paralog.bed.gz, and res.samples.bed.gz. The res.samples.bed.gz reported 10 copies of the CDK20 gene, this is consistent with the actual copy number of this gene in the genome, due to which I used self-WGS data to test it. But the res.samples.bed.gz contain many columns with same 10 copies, which one should I choose as the final result？ Then, I found the res.paralog.bed.gz is empty, Is this normal？

tprodanov commented 1 year ago

Hi @zhoudreames,

In the res.samples.bed file, the first three columns represent the main copy, and all the other copies are written in homologous_regions column. So, overall, in your case, there are in total 5 repeat copies, and in total they have aggregate copy number 10.

Maybe, your ten copies are not so similar to each other, so not all of them were actually found the input region you provided. You can try to take all your 10 repeat copies, and run parascopy examine. It will show the number of homologous regions for each region in the reference.

As to why there are multiple lines in the output: these are different subregions, with some gaps in between. Usually, it means that there are some small duplications in between, or, opposite, short non-duplicated region.

As you can see on the first image, paralog-specific CN (psCN) is unknown (?,?...), therefore res.paralog.bed.gz is empty. You can try to increase the second parameter in --pscn-bound as well, for example set it to --pscn-bound 20 10000. But from my experience, it is very difficult to identify paralog-specific copy number of duplications with 5 and more copies.

Finally, you can update Parascopy, it should not change anything drastically, but there were some bug fixes recently.

zhoudreames commented 1 year ago

Thanks~

tprodanov / parascopy

how to estimate the copy number pf specific CDK20 gene in no human genome #5