twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

Very few SV can be phased #29

Closed tgong1 closed 10 months ago

tgong1 commented 1 year ago

Hi,

I'm using longphase (version 1.4) for SNP and SV co-phasing, while found very few SV can be phased. For example, I have used SNP and SV callset of HG002 for co-phasing, while only get 7 SV phased out of 6,938 het SVs. The commands I have used: /public/home/fan_lab/gongtit/longphase_linux-x64 phase \ -s ${SNV} \ --sv-file ${SV} \ -b $BAM \ -r $REF \ -t 8 \ -o $OUT \ --ont

AND the input VCFs: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh37/HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz (only PASS SVs used as input for longphase)

Any suggestions or idea are appreciated!

Thank you for your time and help, Tingting

tgong1 commented 1 year ago

Hi, I have tried the latest version 1.5.1 on the same input VCFs. Now I got 3,641 (out of 6,938 het SVs) phased by longphase. While the proportion of phased SVs is still lower than what you have in the Table S7 v.s. Table S5. I got around 80% SNP phased with longphase1.4 and around 84% SNP phased with longphase1.5.1.

What can be the reason changing the number of phased SVs? What other I can try to increase the number of phased SVs?

Thank you for your time and help, Tingting

twolinin commented 1 year ago

Hi @tgong1,

We recommend using SNPs that are identified from the variant calling results of the sequences you are using, such as PEPPER, rather than directly using benchmark VCF files. This is because the variants included in benchmark datasets might differ from those in your specific sequences. For SV (structural variant) analysis, please use one of the variant calling software tools, Sniffles (with --num_reads_report in Sniffles1 and --output-rnames in Sniffles2) or CuteSV (with --report_readid--genotype).

Thanks

tgong1 commented 11 months ago

Thank you for the reply. The parameter --output-rnames helped! I'm wondering if the use of aligner, e.g. minimap2 or NGMLR can influence the performance of longphase?

Thank you, Tingting

ythuang0522 commented 11 months ago

My impression of NGMLR is dedicated to SVs, which we never tested before due to speed concern. The majority of small variant callers (e.g., DeepVariant/Clair) seem trained using minimap2. As SNPs are the major source of phasing material, you might have to take this into account. Theoretically, large SVs may be better called by NGMLR/sniffle. However, it looked to me the new sniffle2 team now used minimap2 instead. So, minimap2 may be a safe choice. Would be happy to know your comparison results, if any.

tgong1 commented 10 months ago

Hi, Thank you for the reply. From what I can see, the aligner (NGMLR or minimap2) is not changing much on the accuracy of SV phasing using longphase (not based on a very rigorous benchmarking). Thank you for all the help and developing this great tool. I will close this issue.