Lack of support for (b)gzipped vcf files and analysing only part(s) of the files (region support)

dvanderleest commented 6 years ago

svtyper currently doesn't support the (b)gzipped file format. This leads to the following type of error.

Command: svtyper -B $(ls analysis/temp/*/*piped.bam | paste -sd",") -i vcf/StructuralVariants.raw.lumpy.sorted.vcf.gz -l my.bams.json > vcf/StructuralVariants.gt.vcf

Output: Traceback (most recent call last): File "/bin/svtyper", line 11, in <module> load_entry_point('svtyper==0.6.0', 'console_scripts', 'svtyper')() File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 572, in cli sys.exit(main()) File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 565, in main args.max_reads) File "/usr/lib/python2.7/site-packages/svtyper/classic.py", line 212, in sv_genotype var = Variant(v, vcf) File "/usr/lib/python2.7/site-packages/svtyper/parsers.py", line 253, in __init__ self.pos = int(var_list[1]) ValueError: invalid literal for int() with base 10: 'b\xee\xbd\xd1\xfb\xf0\x87\r\xa4=)\xa6\xa2\xda\xe8-\x81\x96\xe0\x8f\x7f|7\xbc\xe9\xeb\xe6}\x9f\xc1;\xce.\xf2'

The value of var_list[1] is a string of binary gibberish, which of course cannot be converted to an integer. This issue should be easy to solve by incorporating for instance bgzip -d -c or gunzip ( which is the same as gzip -d -c) at the location where the lines of the file are read. In the past hour I didn't find the correct site in the code yet.

dvanderleest commented 6 years ago

I found the main function in /usr/lib/python2.7/site-packages/svtyper/classic.py:

def main():
    # parse the command line args
    args = get_args()

    set_up_logging(args.verbose)

    if args.split_bam is not None:
        sys.stderr.write('Warning: --split_bam (-S) is deprecated. Ignoring %s.\n' % args.split_bam)

    # call primary function
    sv_genotype(args.bam,
        args.input_vcf,        <-- The path to input_vcf is passed to sv_genotype
        args.output_vcf,
        args.min_aligned,
        args.split_weight,
        args.disc_weight,
        args.num_samp,
        args.lib_info_path,
        args.debug,
        args.alignment_outpath,
        args.ref_fasta,
        args.sum_quals,
        args.max_reads)

sv_genotype both reads the input files and genotypes the SVs. Therefore in sv_genotype near line 188 an addition should be made for the parsing of gzipped lines. I couldn't find an open() statement, so I am not sure whether it can simply be added here or not.

dvanderleest commented 6 years ago

The file is read with parser.add_argument('-i', '--input_vcf', metavar='FILE', type=argparse.FileType('r'), default=None, help='VCF input (default: stdin)') at line 25. However, using argparse.FileType is similar to using open() which is unable to read gzipped files.

Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

Perhaps setting the type to string and opening and closing the file with gzip.open() and/or open() enables testing for compressed data.

----Edit---- Come to think of it. It is probably best to incorporate a vcf parser for this.

ernfrid commented 6 years ago

Yes, we've done something similar in svtools. See https://github.com/hall-lab/svtools/blob/master/svtools/utils.py for the class we put together to handle this. We could apply the same strategy here with a bit of refactoring.

ernfrid commented 6 years ago

I'll leave this open until the feature is added.

hall-lab / svtyper

Lack of support for (b)gzipped vcf files and analysing only part(s) of the files (region support) #91