madprime / cgivar2gvcf

Lossy conversion of Complete Genomics var file to VCF
MIT License
5 stars 3 forks source link

Long REF line for a nocall #4

Closed abeconnelly closed 8 years ago

abeconnelly commented 8 years ago

When converting my own CGI-Var file (from my Harvard PGP page), there are lines that I would have thought should be a simple "nocall" line but have a long string of reference (under the "REF" column) but still have a 'NOCALL' as a filter.

For example, the following shows up:

chr1    997434  .       C       .       .       PASS    END=997442      GT      0/0
chr1    997443  .       CCTTGTCCCCGTTCCCTCCGTCCCTCTCCCCCTTCCTTCCCTCCCTCCCTCACCACCATTCCCTCCCTCCCACAT             .       NOCALL  .       GT      ./.
chr1    997518  .       C       .       .       PASS    END=997527      GT      0/0

I ran the following:

python -m cgivar2gvcf -i test0.txt -d $REFDIR

Here is a small snippet (the test0.txt file) that produces the line above (be careful of whitespace vs. tab separation if you cut and paste)

#APPROVAL       Records of report approval are on file with Complete Genomics, Inc.
#TITLE  Whole Human Genome Sequencing
#ADDRESS        This report was prepared by Complete Genomics Inc. at 2071 Stierlin Ct., Mountain View, CA 94043
#CUSTOMER_SAMPLE_ID     hu826751
#SAMPLE_SOURCE  Other
#REPORTED_GENDER        MALE
#CALLED_GENDER  MALE
#TUMOR_STATUS   no
#LIBRARY_TYPE   Pure LFR
#LIBRARY_SOURCE Version 2
#ASSEMBLY_ID    GS000037338-ASM
#COSMIC COSMIC v65
#DBSNP_BUILD    dbSNP build 137
#GENOME_REFERENCE       NCBI build 37
#SAMPLE GS03052-DNA_B01
#GENERATED_BY   cgatools
#GENERATED_AT   2014-Jul-01 04:55:14.521195
#SOFTWARE_VERSION       2.5.0.33
#FORMAT_VERSION 2.5
#GENERATED_BY   dbsnptool
#TYPE   VAR-ANNOTATION

>locus  ploidy  allele  chromosome      begin   end     varType reference       alleleSeq       varScoreVAF     varScoreEAF     varFilter       hapLink xRef    alleleFreq      alternativeCalls
21576   2       all     chr1    997408  997432  ref     =       =
21577   2       all     chr1    997432  997433  no-call =       ?
21578   2       all     chr1    997433  997442  ref     =       =
21579   2       1       chr1    997442  997517  no-call CCTTGTCCCCGTTCCCTCCGTCCCTCTCCCCCTTCCTTCCCTCCCTCCCTCACCACCATTCCCTCCCTCCCACAT     ?                               6427
21579   2       2       chr1    997442  997453  sub     CCTTGTCCCCG     TCCCCCTTCC      21      21      AMBIGUOUS;VQLOW 6428                    TCCCCCTTCT:-10;TCCCCCTTCG:-10;TCCCCCTTTC:-10;TCCCCCTTGC:-10;TCCCCTTTCC:-11;TCCCCGTTCC:-12;TCCCTCTTCC:-13;TCCCGCTTCC:-13;TCCTCCTTCC:-13;TCCGCCTTCC:-13;TTCCCCTTCC:-16;TGCCCCTTCC:-16;TCTCCCTTCC:-16;TCGCCCTTCC:-16
21579   2       2       chr1    997453  997455  ref     TT      TT      21      21      VQLOW   6428
21579   2       2       chr1    997455  997517  no-call CCCTCCGTCCCTCTCCCCCTTCCTTCCCTCCCTCCCTCACCACCATTCCCTCCCTCCCACAT  ?                               6428
21580   2       all     chr1    997517  997527  ref     =       =
21581   2       all     chr1    997527  997598  no-call =       ?
21582   2       all     chr1    997598  997633  ref     =       =

This produces:

##fileformat=VCFv4.1
##fileDate=201617
##source=cgivar2gvcf-version-0.1.5
##description="Produced from a Complete Genomics var file using cgivar2gvcf. Not intended for clinical use."
##reference=hg19.2bit
##FILTER=<ID=NOCALL,Description="Some or all of this record had no sequence call by Complete Genomics">
##FILTER=<ID=VQLOW,Description="Some or all of this sequence call marked as low variant quality by Complete Genomics">
##FILTER=<ID=AMBIGUOUS,Description="Some or all of this sequence call marked as ambiguous by Complete Genomics">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    997409  .       C       .       .       PASS    END=997432      GT      0/0
chr1    997433  .       T       .       .       NOCALL  END=997433      GT      ./.
chr1    997434  .       C       .       .       PASS    END=997442      GT      0/0
chr1    997443  .       CCTTGTCCCCGTTCCCTCCGTCCCTCTCCCCCTTCCTTCCCTCCCTCCCTCACCACCATTCCCTCCCTCCCACAT             .       NOCALL  .       GT      ./.
chr1    997518  .       C       .       .       PASS    END=997527      GT      0/0
chr1    997528  .       C       .       .       NOCALL  END=997598      GT      ./.
chr1    997599  .       G       .       .       PASS    END=997633      GT      0/0

Here is the file: test0.txt