senzhaocode / SV_standard

Convert raw SV outputs of multiple callers (using either RNA-seq or RNA-seq data) to FuSViz input format
MIT License
0 stars 0 forks source link

Wrong script/svaba_svtype.R #1

Open weizhu365 opened 1 week ago

weizhu365 commented 1 week ago

The R script is to fix the svaba VCF format, but there are several obvious bugs in the R script.

First, cols needs to be read first:

cols <- NA
if ( grepl("\\.gz$", args[1]) ) {
    svaba_uniq <- read.table(pipe(paste0('zgrep -v "##" ', args[1], ' | sed s/#//')), header=T, col.names = cols, stringsAsFactors = FALSE)
    cols <- colnames(read.table(pipe(paste0('zgrep -v "##" ', args[1], ' | sed s/#//')), header=T))
} else if ( grepl("\\.vcf$", args[1]) ) {

    svaba_uniq <- read.table(pipe(paste0('grep -v "##" ', args[1], ' | sed s/#//')), header=T, col.names = cols, stringsAsFactors = FALSE)
        cols <- colnames(read.table(pipe(paste0('grep -v "##" ', args[1], ' | sed s/#//')), header=T))

}

And The following two lines needs to be comment out as it is not right to assume that there are two samples and all with .extension (one sample with the proper sample id is more common):

cols[10] <- unlist(strsplit(cols[10], '\\.'))[10]
cols[11] <- unlist(strsplit(cols[11], '\\.'))[10]

Lastly, the output is a TSV file not VCF (not as described in README):

write.table(svaba_uniq, file=args[2], quote=F, row.names=F, col.names=F, sep='\t')

I guess the right version has been developed but has not been updated in this repo. It should be an easy fix.

By the way, the similar issue for the Rscript rename-gridss.R, which should have been updated to accept command-line inputs (rather than to be hard-coded as it is now).

Nevertheless, your scripts provide a good solution to merge SVs from GRIDSS and svaba.

Thanks, Wei Zhu

senzhaocode commented 1 week ago

Dear Wei Zhu,

Thank you for the comments. Very sorry that the script _svabasvtype.R did not get an update in the repo. Now I believe these bugs related to it are fixed.

In terms of rename-gridss.R script, I have done an update to accept a few options under CLI, as you suggested. As its functionality depends on several R packages (e.g., VariantAnnotation, StructuralVariantAnnotation) installed, I would like to have more example testing before a release. I think the update one will be available on github in 1-2 days. I will keep this ticket open until the issue of rename-gridss.R is fixed as well.

Best

Sen

weizhu365 commented 1 week ago

Many thanks for your quick update.

By the way, I have a question about INS. It seems INS is treated as DUP now. Is it possible to specify INS from the other SV types in svaba?

Best,

Wei

senzhaocode commented 1 week ago

Hello Wei,

Thank you for raising this question. This script is used the strand direction to infer SV category for svaba calls. If strand direction is -+, it can be either duplication and insertion, as you pointed out. So far, it is difficult to distinguish INS from DUP based on the strand direction for svaba calls. In terms of occurrence probability, DUP (in particular tandem duplication) generally shows more prevalent than long INS in genomic rearrangement events. Here, INS is treated as DUP arbitrarily - I agree this is not a good solution, but I have no other better solutions.

Virtually, SV callers using local assembly approaches (e.g. svaba and gridss) abandon a traditional concept for the classification of SV types (i.e. duplication, deletion, insertion, reversion and translocation), and they think these categories are imprecise and too simple to interpret complex SV. There could be some problems if users would like to assign a category type to SV call of gridss and svaba precisely (like an issue to distinguishing DUP from INS).

Best

Sen

weizhu365 commented 1 week ago

Sen;

I understand the problem in both svaba and GRIDSS. I am working on the right SVtype is to use SURVIVOR to merge them. Right now, many programs (like SURVIVOR or other SV merge/genotype tool) only recognize manta VCF format (not BND VCF format). (Please let me know if you know bfx tools to do better jobs than SURVIVOR, svimmer, etc).

Here is an interesting PERL script for your reference:

https://raw.githubusercontent.com/stat-lab/EvalSVcallers/master/scripts/vcf_convert/convert_GRIDSS_vcf.pl

where, the distance between BNDs is checked INS is defined if the distance is < 30 bp.

Even though the script is for GRIDSS, I wonder whether you may borrow the same idea to separate INS from DUP.

Nevertheless, it should be reasonable to treat short DUP as INS. :-)

Thanks,

Wei