how to use custom genome as reference

tprodanov / parascopy

Copy number estimation and variant calling for duplicated genes using WGS.

MIT License

22 stars 3 forks source link

how to use custom genome as reference #1

Closed zou-yawen closed 1 year ago

zou-yawen commented 1 year ago

Hi, I want to use custom genome as reference instead of hg38;but I don`t know how to do this.Could you please help me.

tprodanov commented 1 year ago

Hi @zou-yawen

To use a custom genome, please see this readme section. Basically, first you need to create a database of duplications (parascopy pretable and parascopy table). For that the reference genome needs to be indexed with samtools faidx and bwa index. Then, you can run parascopy cn as for the regular human genome. Note, if you plan to use Parascopy first a small number of samples (for example one), aggregate copy number will be reported, but paralog-specific copy number will be not. You can modify this behavior with --min-samples 1, but be careful, the results will not be as reliable as if you use many samples.

zou-yawen commented 1 year ago

Thanks a lot.When I use samtools(1.15.1) faidx and bwa(0.7.17) index to create genome index ,then use "parascopy pretable -f genome.fa -o pretable.bed.gz".Something wrong with it.

tprodanov commented 1 year ago

The errors on the first screenshot are not important, it is more of a warning.

What would happen if you run samtools faidx genome.fa ChrMT > chr_mt.fa? If it works, can you send me the output? It is possible that there is something wrong with the input fasta file (incomplete MT chromosome).

zou-yawen commented 1 year ago

"samtools faidx genome.fa ChrMT > chr_mt.fa" does`t work,error is below: below is rigth:

tprodanov commented 1 year ago

So it is a problem with the reference fasta file (incomplete MT sequence), not with Parascopy. You can try to remove MT from the reference, or just create/download the reference anew. You can also check the .fai file, maybe there is a problem there. You can find format description here.

zou-yawen commented 1 year ago

Thanks a lot.I download the reference from website again,it works.

zou-yawen commented 1 year ago

Hi,do you have any advice about --bed-regions , what is the size of every window.I do not want to use hg19/hg38.

tprodanov commented 1 year ago

First of all, you can check out the regions here to get an idea. For both genomes we used a set of ~90k windows, with window length 100 bp, and all do not overlap the homology table.

The problem is that many genomic regions have some biases, so it is not so trivial to select good set of background windows. You can try to select many exons, extend them to the left and right, and filter regions so that they do not overlap the homology table:

bedtools slop -i input.bed -g genome.fa.fai -b 5000 | \
    bedtools merge -i - -d 1000 > tmp/ext_regions.bed
bedtools makewindows -b tmp/ext_regions.bed -w 100 > tmp/windows.bed
zgrep -v '#' table.bed.gz | cut -f1-3 | \
    bedtools slop -b 500 -i - -g genome.fa.fai | \
    bedtools merge -i - -d 500 > tmp/table.bed
bedtools intersect -a tmp/windows.bed -b tmp/table.bed -v | \
    awk '$3 - $2 == 100' > windows.bed

I did not run this code right now, but it should be more or less correct.