knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

issue reading vcf generated by lofreq #80

Closed k-florek closed 7 years ago

k-florek commented 7 years ago

I'm trying to use vcfR to read in a vcf file generated by lofreq and I'm seem to be having issues when trying to use the resulting vcfR class. I think it might be related to missing gt region in the vcf from lofreq, but I'm not sure.

> vcf <- read.vcfR("file.vcf", verbose = TRUE)
Meta line 15 read in.
All meta lines processed.
Character matrix gt created.
Character matrix gt rows: 182
Character matrix gt cols: 8
skip: 0
nrows: 182
row_num: 0

Processed variant: 182
All variants processed

The output of the class is:

> vcf
***** Object of Class vcfR *****
0 samples
1 CHROMs
0 variants
Object size: 0 Mb
NaN percent missing data
*****        *****         *****

and if I try to look at the header:

> head(vcf)
[1] "***** Object of class 'vcfR' *****"
[1] "***** Meta section *****"
[1] "##fileformat=VCFv4.0"
[1] "##fileDate=20171010"
[1] "##source=lofreq call -f ../senterica_thompson_atcc8391.fna -a 0.05 - [Truncated]"
[1] "##reference=../senterica_thompson_atcc8391.fna"
[1] "##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Raw Depth\">"
[1] "##INFO=<ID=AF,Number=1,Type=Float,Description=\"Allele Frequency\">"
[1] "First 6 rows."
[1] 
[1] "***** Fixed section *****"
     CHROM           POS     ID REF ALT QUAL   FILTER
[1,] "NZ_CP011396.1" "17179" NA "G" "A" "1109" "PASS"
[2,] "NZ_CP011396.1" "17711" NA "C" "T" "1411" "PASS"
[3,] "NZ_CP011396.1" "81296" NA "C" "T" "1159" "PASS"
[4,] "NZ_CP011396.1" "85988" NA "C" "A" "1530" "PASS"
[5,] "NZ_CP011396.1" "98517" NA "A" "G" "956"  "PASS"
[6,] "NZ_CP011396.1" "99261" NA "C" "T" "375"  "PASS"
[1] 
[1] "***** Genotype section *****"
<0 x 0 matrix>
[1] 
[1] "Unique GT formats:"
Error in print(unique(as.character(x@gt[, 1]))) : 
  error in evaluating the argument 'x' in selecting a method for function 'print': Error in x@gt[, 1] : subscript out of bounds

similarly when I try to run proc.chrmR I get another error:

> vcf <- proc.chromR(vcf, verbose = TRUE)
Nucleotide regions complete.
  elapsed time:  1.705
N regions complete.
  elapsed time:  1.537
Error in nrow(x@vcf@gt[x@var.info$mask, , drop = FALSE]) : 
  error in evaluating the argument 'x' in selecting a method for function 'nrow': Error in x@vcf@gt[x@var.info$mask, , drop = FALSE] : 
  (subscript) logical subscript too long

here is my sessionInfo:


Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vcfR_1.5.0 ape_4.1   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13      lattice_0.20-33   viridisLite_0.2.0 permute_0.9-4     memuse_3.0-1      MASS_7.3-45      
 [7] grid_3.2.3        nlme_3.1-124      magrittr_1.5      vegan_2.4-4       Matrix_1.2-3      pinfsc50_1.1.0   
[13] tools_3.2.3       parallel_3.2.3    cluster_2.0.3     mgcv_1.8-11```
knausb commented 7 years ago

Hello,

I personally don't work with these VCF files with not gt section. So I don't have very much experience running them through vcfR. But if you're willing to help me...

This is not actually an issue with reading the VCF data into R. I like to think I have that under control and have unit tests to help with that. You demonstrate above that you successfully read your data into R. The issues you report are downstream processing with the functions head() and proc_chrom(). Your interpretation of the issue appears correct in that the functions are trying to access portions of the gt regions that do not exist in your data.

I have never used lofreq, so could you please validate something for me? You report the following works.

> vcf <- read.vcfR("file.vcf", verbose = TRUE)

Let's validate the state of the gt slot. you should get the following.

vcf@gt
<0 x 0 matrix>

The initial state of the gt slot should be a matrix with zero rows and zero columns. If this is what you are seeing than I can modify functions to check for this prior to trying to manipulate that slot. And that should give you functionality. Can you try that and let me know how it went? Thanks!

k-florek commented 7 years ago

The functionality of vcfR is great so I'll definitely try to help the best I can!

The state of the gt slot is a 0x0 matrix.

vcf@gt
<0 x 0 matrix>
knausb commented 7 years ago

Thanks for validating that. This suggests that I should be able to reproduce the behaviour you reported as follows.

data("vcfR_example")
vcf@gt <- matrix(nrow=0, ncol=0)

I think I've addressed the head() issue in my development version. But I can't reproduce the proc_chromR() issue.

myChrom <- proc.chromR(vcf)
Error: class(x) == "chromR" is not TRUE

This is because one of the first things proc.chromR() does is to check the class you gave it and bail out if it isn't chromR. So I think you've left out a step. My guess would be as follows.

data("vcfR_example")
vcf@gt <- matrix(nrow=0, ncol=0)
class(vcf)
myChrom <- create.chromR(vcf, verbose = FALSE)
myChrom <- proc.chromR(myChrom)
Error in x@vcf@gt[x@var.info$mask, , drop = FALSE] : 
  (subscript) logical subscript too long
In addition: Warning messages:
1: In proc.chromR(myChrom) : seq slot is NULL.
2: In proc.chromR(myChrom) : annotation slot has no rows.
3: In proc.chromR(myChrom) :
  seq slot is NULL, chromosome representation not made (seq2rects).
4: In proc.chromR(myChrom) :
  seq slot is NULL, chromosome representation not made (seq2rects, chars=n).

This is an error that appears associated with the lack of a gt slot. And I'll work on addressing it. But it does not appear to be the same as what you have reported. Could you please review the steps that led you to your error? Can you provide me with an example that will help me reproduce this behaviour on my end? You can either using the data I presented above or send me your own.

Thanks!

k-florek commented 7 years ago

In my description of the issue I left out the step where I created the chromR object since it didn't return any errors.

The reason our error messages are different is because in the create.chromR() function I am specifying the vcf file, reference sequence, and annotations. Here is my full script. I changed the name of the variables when I submitted my initial ticket to reduce confusion.

library(ape)
library(vcfR)

mag_pure <- read.vcfR("alignments/PNUSAS024723.vcf",verbose = TRUE)
hand <-  read.vcfR("alignments/PNUSAS024723H.vcf",verbose = TRUE)
reference <- read.dna("senterica_thompson_atcc8391.fna",format = "fasta")
annotations <- read.table("senterica_thompson_atcc8391.gff",sep="\t",quote="")

chrom_mag_pure <- create.chromR(name="Magna_Pure", vcf=mag_pure,seq=reference,ann=annotations)
chrom_hand <- create.chromR(name="Hand_purification",vcf=hand,seq=reference,ann=annotations)

#look at variants
plot(chrom_mag_pure)
plot(chrom_hand)

#process chromR object
chrom_mag_pure <- proc.chromR(chrom_mag_pure, verbose = TRUE)
chrom_hand <- proc.chromR(chrom_hand,verbose = TRUE)

chromoqc(chrom_mag_pure, dp.alpha=20)
chromoqc(chrom_hand, dp.alpha=20)

Hopefully this will help! I've attached one of the vcf files in case it's helpful. PNUSAS024723.vcf.zip

knausb commented 7 years ago

I think that's just what I needed! I now have the following.

library(vcfR)
data("vcfR_example")
vcf@gt <- matrix(nrow=0, ncol=0)
myChrom <- create.chromR(vcf=vcf, seq = dna, ann = gff, verbose = FALSE)
myChrom <- proc.chromR(myChrom)

And it now throws no errors. In hindsight, the error you reported and the one I generated appear to be the same. The difference in output was due to the lack of seq and ann.

This now appears to work and plots a chromoqc() as well.

These changes are now on the master branch at GitHub. If you'd like to try it you can use the following.

devtools::install_github(repo="knausb/vcfR")

To be honest, when I started this project I did not realize you could have a valid VCF that had no gt. But this is supported by the VCF specification, so I feel obligated to support it. If you notice anything else please let me know.

Thanks!

k-florek commented 7 years ago

Works beautifully! Thanks a bunch, I'll let you know if anything else breaks.