twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

How to calculate block N50 when co-phasing SNP and SV? #54

Closed Jerry-is-a-mouse closed 6 months ago

Jerry-is-a-mouse commented 6 months ago

@twolinin If phasing only with snps, it is OK to use whatshap stats to get block N50. And you mentioned that associated with Nanopore datas and co-phasing SV, it may produce a larger block N50. But what tool is suitable to calculate block N50 when SV phased together? Whatshap doesn't have the function of phasing SV as introduced in it document. But you used it to do co-phasing in your paper, although I tested it and it can phase a number of SV.

twolinin commented 6 months ago

Hi @Jerry-is-a-mouse, If you want to calculate the block N50, I can provide you with two simple shell scripts for testing.

cat snp.vcf sv.vcf > merge.vcf
bash snp_block.sh merge
bash block.sh merge.block

snp_block.sh

grep -e "0|1" -e "1|0" $1.vcf |
grep -v "#" |
awk '{
    split($9, tags, ":")
    split($10, values, ":")
    for (i=1; i<=length(tags); i++) {
        if (tags[i] == "GT") {
            gt_value = values[i]
        }
        if (tags[i] == "PS") {
            ps_value = values[i]
        }
    }
    print  $1 "\t" $2 "\t" gt_value "\t" ps_value
}' |
sort -k1,1 -k4,4n -k2,2n |
awk '{
    if( $4 != upPS ){
        if(NR>1 && count > 1){
            print upChr "\t" strPos "\t" upPos - strPos
        }

        strPos = $2
        count=0
    }
    upPS = $4
    upPos = $2
    upChr = $1
    count++
}
END{
    print upChr "\t" strPos "\t" upPos - strPos
}' | sort -k3rn > $1.block

block.sh

awk '
BEGIN{
    block=0
}
ARGIND==1{
    #sum of block length
    sum+=$3
    block++
}
ARGIND==2{
    count+=$3
    if(count>=sum/2){print "#Block\t" block "\tN50:\t" $3  "\tSum\t" sum ;exit;}
}' $1 $1
Jerry-is-a-mouse commented 6 months ago

@twolinin Thank you very much for your in-time reply. I have another question to ask for your help, that is, is it right to put snp.phased.vcf and sv.phased.vcf in only 1 vcf format file, then calculate the block N50 using whatshap stats? I just found that whatshap phase also phase sv if pass a sv.vcf file to this command, and the paper of longphase also agrees this. And I tried to turn 2 vcf files into only 1, then pass it to whatshap stats and that outputs a block N50 too.

Jerry-is-a-mouse commented 6 months ago

And that as you mentioned in https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac058/6519151, you evaluated the co-phasing accuracy of LongPhase and whatshap, I am not sure how to perform it, the document of Whatshap do not say that, so I want your help again .

twolinin commented 6 months ago

Hi @Jerry-is-a-mouse, If you want to merge SNP.vcf and SV.vcf and calculate the result using whatshap stats, you can sort the variants from both VCF files by chromosome and position to generate a new VCF. This new VCF can then be used with whatshap stats or whatshap phase.

In order to utilize whatshap compare for calculating SV switch error rates in the paper, we intersected the SV benchmark VCF with the SV VCF used for phasing. We then standardized the coordinates, REF, ALT, and other fields of the same type of SV that had intersections.

Jerry-is-a-mouse commented 6 months ago

@twolinin I am so sorry that I did not express my questions properly. I agree whatshap stats only accepts 1 vcf file as an input, so I have to merge snp.vcf and sv.vcf (both phased) to 1 vcf file containing both snp and sv. So my question is that how to perform co-phasing using whatshap phase command. Due to whatshap phase only accepts 1 unphased vcf file as an input, I think co-phasing can be performed in these two ways as follows: 1) merge unphased snp and sv vcf files first, then use it as input for whatshap phase command, the output phased vcf file can be used as input fo whatshap stats and whatshap compare; 2) phase unphased snp vcf file and sv file separately, merge the phased snp and sv files to a phased vcf file containing both snp and sv then, and use it as input for whatshap stats and whatshap compare. I am not sure which one is a right way or a more right way. Sorry again.

twolinin commented 6 months ago

@Jerry-is-a-mouse,

  1. If we aim to obtain a more comprehensive phasing result, it is necessary to co-phase two types of variants. The difference lies in the method used: with whatshap phase, the two VCF files need to be merged into one, whereas longphase provides the --sv-file option for users to input the SV file. As for whatshap stats and whatshap compare, they merely provide comparative data results, and there are various approaches to achieve this.
  2. Phasing SNP and SV separately will result in two completely non-overlapping phase sets. Merging the two non-overlapping phase sets VCF files will not yield a more complete phase set or block N50.