knausb / vcfR

Tools to work with variant call format files
249 stars 54 forks source link

Error when try to read a vcf.gz file into vcfR? #100

Closed Tman3 closed 6 years ago

Tman3 commented 6 years ago

I have a "abc.vcf.gz" file resulting from bcftools mpileup. Have used zless to view the file content, and bcftools stat to analyze the vcf.gz file.

I would like to try vcfR. However, I am unable to find an example on how to read the vcf.gz, reference.fa, and reference.gff files into vcfR as dataframe. Most of the current examples involved reading in the data through the package , "pinfsc50".

I have tried the following command in Rstudio.... abc_vcf = read.vcfR('abc.vcf.gz', limit = 1e+07, nrows = -1, skip = 0, cols = NULL, convertNA = TRUE, checkFile = TRUE, verbose = TRUE) The error message: Error: unexpected symbol in "abc"

I am a newbie. The abc.vcf.gz, reference.fa, and reference.gff are in the same working directory.

Would like to get your inputs on how to read in the 3 files. or how to find out what to look for with the above error message.

Or how to form a data package similar to "pinfsc50 for input into vcfR.

Many thanks! sessionInfo() is below... sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ggplot2_2.2.1 reshape2_1.4.3 vcfR_1.7.0

loaded via a namespace (and not attached): [1] Rcpp_0.12.15 cluster_2.0.6 magrittr_1.5 MASS_7.3-47 munsell_0.4.3 colorspace_1.3-2 [7] viridisLite_0.3.0 ape_5.0 lattice_0.20-35 rlang_0.2.0 stringr_1.3.0 plyr_1.8.4
[13] tools_3.4.3 parallel_3.4.3 grid_3.4.3 nlme_3.1-131 gtable_0.2.0 mgcv_1.8-22
[19] vegan_2.4-6 lazyeval_0.2.1 tibble_1.4.2 permute_0.9-4 Matrix_1.2-12 rsconnect_0.8.8
[25] pinfsc50_1.1.0 stringi_1.1.6 pillar_1.2.1 compiler_3.4.3 scales_0.5.0

knausb commented 6 years ago

Hi @Tman3,

This sounds like there is an issue with your VCF file. However, without a minimal reproducible example that would help me reproduce your error on my machine I'm very limited in what I can do. Can you post your VCF file to the issue, or e-mail me directly. If your file is large can you post a portion of it that reproduces your error?

The pinfsc50 dataset is just a convenient way to distribute data. But you should be able to use your data very similarly. There is an example included with vcfR that is even smaller, which is nice because you can see it on your screen. You should be able to create an example VCF file in your working directory as follows.

library(vcfR)
data("vcfR_test")
vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****
write.vcf(x=vcfR_test, file = 'myData.vcf.gz')

You can take a peak at the first 10 lines as follows.

scan("myData.vcf.gz", what='character', nmax=10, sep = '\n')

This should print the first 10 lines so you can check if anything seems out of place. You should now be able to read it in as follows.

vcf <- read.vcfR('myData.vcf.gz')
Scanning file to determine attributes.
File attributes:
  meta lines: 18
  header_line: 19
  variant count: 5
  column count: 12
Meta line 18 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 5
  Character matrix gt cols: 12
  skip: 0
  nrows: 5
  row_num: 0
Processed variant: 5
All variants processed
vcf
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****

You do not report any of the output that read.vcfR() reports, which suggests to me something is wrong with your file. If you can't share the file, or part of it, could you report the results of the scan command above?

Thanks! Brian

Tman3 commented 6 years ago

Hi Brian: Thank for your speed reply. I have tried your scan command on the aba.vcf.gz. I was unable to scan the file and returned the first 10 lines. I tried it with 100 lines, and it still worked. As the file is large, I will send you a link by e-mail. Thanks! Tman3

Tman3 commented 6 years ago

Hi Brian: Here is the link to the file. It is called 08study.vcf.gz.

https://1drv.ms/u/s!AlBJuv7C8eF7hb41rd1VKo8An3M7nA

Thanks! Tman3

On Thu, Mar 29, 2018 at 11:44 PM, Brian Knaus notifications@github.com wrote:

Hi @Tman3 https://github.com/Tman3,

This sounds like there is an issue with your VCF file. However, without a minimal reproducible example https://knausb.github.io/vcfR_documentation/reporting_issue.html that would help me reproduce your error on my machine I'm very limited in what I can do. Can you post your VCF file to the issue, or e-mail me directly. If your file is large can you post a portion of it that reproduces your error?

The pinfsc50 dataset is just a convenient way to distribute data. But you should be able to use your data very similarly. There is an example included with vcfR that is even smaller, which is nice because you can see it on your screen. You should be able to create an example VCF file in your working directory as follows.

library(vcfR) data("vcfR_test") vcfR_test Object of Class vcfR 3 samples 1 CHROMs 5 variants Object size: 0 Mb 0 percent missing data


write.vcf(x=vcfR_test, file = 'myData.vcf.gz')

You can take a peak at the first 10 lines as follows.

scan("myData.vcf.gz", what='character', nmax=10, sep = '\n')

This should print the first 10 lines so you can check if anything seems out of place. You should now be able to read it in as follows.

vcf <- read.vcfR('myData.vcf.gz') Scanning file to determine attributes. File attributes: meta lines: 18 header_line: 19 variant count: 5 column count: 12 Meta line 18 read in. All meta lines processed. gt matrix initialized. Character matrix gt created. Character matrix gt rows: 5 Character matrix gt cols: 12 skip: 0 nrows: 5 row_num: 0 Processed variant: 5 All variants processed vcf Object of Class vcfR 3 samples 1 CHROMs 5 variants Object size: 0 Mb 0 percent missing data


You do not report any of the output that read.vcfR() reports, which suggests to me something is wrong with your file. If you can't share the file, or part of it, could you report the results of the scan command above?

Thanks! Brian

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/100#issuecomment-377278625, or mute the thread https://github.com/notifications/unsubscribe-auth/AV7TkK2TxVq0EoTOMXN3yHR2DzP0d8wQks5tjQFugaJpZM4TAW7- .

knausb commented 6 years ago

Hi Tman3,

You state:

I was unable to scan the file and returned the first 10 lines.

Then you state:

I tried it with 100 lines, and it still worked.

This appears to be a contradiction, so I do not know how to interpret it.

Thanks for providing your file as an example! I was successful at using scan() to get the first 10 lines and the first 100 lines. Note that this will not be formatted very pretty, I'm just trying to see if you can read the file contents into R through an alternate path.

When I try to open your file, here's what I see.

> library(vcfR)

   *****       ***   vcfR   ***       *****
   This is vcfR 1.7.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****

> vcf <- read.vcfR('08study.vcf.gz')
Scanning file to determine attributes.
File attributes:
  meta lines: 44
  header line: 45
  variant count: 1165676
  column count: 30
Meta line 44 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
  Character matrix gt rows: 1165676
  Character matrix gt cols: 30
  skip: 0
  nrows: 1165676
  row_num: 0
Processed variant: 1165676
All variants processed
> vcf
***** Object of Class vcfR *****
21 samples
14 CHROMs
1,165,676 variants
Object size: 587 Mb
0 percent missing data
*****        *****         *****

I appear to be successful at reading in your file. Note that I am using Ubuntu 16.04 and you report using Windows 7. Because I am able to open it and you are not, this may mean that the error is specific to your system. With a little googling I found this SO post. It suggests that there is an issue with the R code. Yet read.vcfR() is fairly well tested at this point (over 23k downloads). Is it possible that there is other code you are using that you did not previously share? Perhaps try copying and pasting my example above and see if you have any success. If not, perhaps your installation of vcfR has become corrupted somehow? You could try removing it and reinstalling it.

remove.packages('vcfR')
install.packages('vcfR')

Again, because I can't reproduce this error on my machine it is very difficult for me to troubleshoot. Please give my suggestions a try and let me know how they turn out.

Thanks! Brian

knausb commented 6 years ago

This page also seems to indicate the issue may have to do with syntax. Is it possible that you're missing a comma somewhere? Your original post seemed well formed. Note that you do not need to include default parameters in the call. You only need these if you want to deviate from the default. If you omit them and use the defaults it makes for a simpler command. Simplicity typically means less opportunity for error.

Good luck! Brian

knausb commented 6 years ago

When I try:

> vcf <- read.vcfR('08study.vcf.gz' limit = 1e+07)
Error: unexpected symbol in "vcf <- read.vcfR('08study.vcf.gz' limit"

Which appears very similar to the originally posted error.

Tman3 commented 6 years ago

I finally made vcfR read in the file under Windows 7 in RStudio. What I did is switching back to R version 3.4.3 instead of 3.4.4. Thanks for your assistance.

knausb commented 6 years ago

I just wanted to follow up on this by stating that I do not consider this to be an answer in that we have not identified a problem and addressed it. If you're up and running then that's fine, sometimes that's all we're interested in. But if someone else reads this I wanted to state that I do not feel this is an issue with R 3.4.4. I was able to read the file in with R 3.4.4 and the below session info.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vcfR_1.7.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16      lattice_0.20-35   ape_5.0           viridisLite_0.2.0 permute_0.9-4     MASS_7.3-49      
 [7] grid_3.4.4        nlme_3.1-131.1    magrittr_1.5      vegan_2.4-4       Matrix_1.2-11     pinfsc50_1.1.0   
[13] tools_3.4.4       yaml_2.1.18       parallel_3.4.4    compiler_3.4.4    cluster_2.0.6     mgcv_1.8-23   

I suspect the issue is that the user's R installation, or perhaps a library, became corrupted and that reverting the R version installed a new copy of whatever broke. I would guess that if the user upgraded to 3.4.4 now it would be fixed.

Thanks! Brian