Closed zhoudreames closed 1 year ago
@tprodanov I am so sorry to disturb you, would you help me? Thanks~
Hi @zhoudreames ,
Yes, it is more or less correct. I have two notes:
res.paralog.bed.gz
file anyway.Additionally, by default Parascopy
does not call paralog-specific copy number for such high-copy duplications. You can change that with --pscn-bound
parameter, but probably it would not work (as there are exponential number of possible paralog-specific copy number tuples, it will take too much time).
However, Parascopy
will still output aggregate copy number across all repeat copies, which can also be helpful.
In any case, please try to run this command and write to me, if you have further questions!
Hi @tprodanov I am so sorry to see your reply too late, thank you for your detailed response; I followed your second method, and I got a good result, but there are some things I don't quite understand: First, my code is shown below:
#1. I selected one of the CDK20 genes with regions.bed
chr14 18309334 18335583
#2. rerun the paracopy
parascopy cn -i self.bam -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1 --psc-bound 20
After the application completes, I got three main filed, including res.matrix.bed.gz, res.paralog.bed.gz, and res.samples.bed.gz.
The res.samples.bed.gz reported 10 copies of the CDK20 gene, this is consistent with the actual copy number of this gene in the genome, due to which I used self-WGS data to test it. But the res.samples.bed.gz contain many columns with same 10 copies, which one should I choose as the final result?
Then, I found the res.paralog.bed.gz is empty, Is this normal?
Hi @zhoudreames,
In the res.samples.bed
file, the first three columns represent the main copy, and all the other copies are written in homologous_regions
column. So, overall, in your case, there are in total 5 repeat copies, and in total they have aggregate copy number 10.
Maybe, your ten copies are not so similar to each other, so not all of them were actually found the input region you provided. You can try to take all your 10 repeat copies, and run parascopy examine
. It will show the number of homologous regions for each region in the reference.
As to why there are multiple lines in the output: these are different subregions, with some gaps in between. Usually, it means that there are some small duplications in between, or, opposite, short non-duplicated region.
As you can see on the first image, paralog-specific CN (psCN) is unknown (?,?...), therefore res.paralog.bed.gz
is empty.
You can try to increase the second parameter in --pscn-bound
as well, for example set it to --pscn-bound 20 10000
.
But from my experience, it is very difficult to identify paralog-specific copy number of duplications with 5 and more copies.
Finally, you can update Parascopy
, it should not change anything drastically, but there were some bug fixes recently.
Thanks~
For our no human genome, the CDK20 genes contain 10 copies, I want to estimate the copy number of CDK20 at the population-level.
Following your pipeline, I run the following code to calculate it.
there are two customer input files, including the windows.bed and regions.bed. the former windows.bed were built from #4 , the regions.bed was shown in below, contained the location of ten CDK20 genes copies:
I don't know if the region.bed is correct? if not, how to make the .bed file for the pipeline?