knausb / vcfR

Tools to work with variant call format files
246 stars 55 forks source link

vcfR2tidy can't handle files with empty INFO #200

Open alkaZeltser opened 2 years ago

alkaZeltser commented 2 years ago

Firstly, thank you for a wonderful package! I, unfortunately, have discovered a bug with an interesting edge-case. I have a VCF that contains no information in the INFO column (. in each row) and accordingly no lines in the header with information on any INFO field (no ##INFO=<> lines).

This becomes a problem when calling the vcfR2tidy function. It generates the following error: Error in strsplit(unlist(x), split = "=") : non-character argument

This error is resolved when a dummy INFO line is added to the @meta object.

I have traced the source to the following function which parses the INFO and FORMAT header information, assuming that a line starting with ##INFO exists: https://github.com/knausb/vcfR/blob/9ef985bc908778972d66d02dfe0115da941212b5/R/vcfR_to_tidy_functions.R#L632

Reproducible example:

data(vcfR_test)
## cause error ##
# remove all INFO lines in @meta object
INFO.meta.lines <- grepl("^##INFO", vcfR_test@meta);
vcfR_test@meta <- vcfR_test@meta[!INFO.meta.lines];
# remove all data from INFO column of @fix object
INFO.col.index <- 8;
vcfR_test@fix[, INFO.col.index] <- rep(NA, nrow(vcfR_test@fix));

vcfR2tidy(vcfR_test)

Error in strsplit(unlist(x), split = "=") : non-character argument

## resolve error ##
dummy.INFO.line <- "##INFO=<ID=AF>"
vcfR_test@meta[length(vcfR_test@meta) + 1] <- dummy.INFO.line

vcfR2tidy(vcfR_test)

Extracting gt element GT
Extracting gt element GQ
Extracting gt element DP
Extracting gt element HQ
knausb commented 2 years ago

Hi @alkaZeltser , thanks for posting this and thank you for creating a minimal reproducible example. That greatly facilitates this conversation! I have reproduced your example and agree that the behavior you report is accurate. I have some questions that I hope you can help to address.

You report VCF data where the INFO column is populated by NA (in R we use NA, the VCF specification uses '.'). Because this column has no data it is not documented in the meta region. If there were any data in the 'INFO' column I do feel that it should be described in the meta region and if it were not I would interpret this as deviating from the VCF specification, and I would not feel obligated to address it. But you're reporting no records in the meta AND missing data in the INFO column. Because I have never encountered your situation I took a look at the VCF specification to make sure you're reporting a reasonable situation. And I do believe it's reasonable. So I'm curious, how was this file created?

The issue appears to be that we have not anticipated a situation where there are zero INFO records in the meta region. Thank you for proposing a solution! If we move forward on this it seems to me we would be better off engineering a solution where it is recognized that there are no INFO records in the meta region as opposed to adding dummy data. I think I can handle this. But I thought I should address your proposal and see if you agree.

Thanks!

alkaZeltser commented 2 years ago

Hi Brian, thank you for your reply!

So I'm curious, how was this file created?

It is indeed an unusual file from my experience; very minimalist. As you say, the INFO column is populated by '.' in the original VCF. This VCF file was created by a genotyping core with data from a genotyping experiment using the Global Screening Array on an Illumina machine (I don't know the exact one, unfortunately). The header records VCF format version 4.1 and processing via gtc_to_vcf 1.2.1. The core uses some of their own quality control procedures in addition to the Illumina defaults to filter certain sites and samples, and that's pretty much it. The FORMAT column is also very minimal, containing just two fields: GT:GQ The QUAL column is also completely empty (just '.').

If we move forward on this it seems to me we would be better off engineering a solution where it is recognized that there are no INFO records in the meta region as opposed to adding dummy data.

I completely agree, my solution was simply a quick hack/sanity check. Accounting for a case with an empty INFO column and no INFO records in the VCF header would be more appropriate.

knausb commented 2 years ago

Hi @alkaZeltser , I'm not familiar with gtc_to_vcf. Thanks for letting me know.

I think I have a solution for your issue. It's in the master branch now. If you'd like to give it a try, the following should install it.

devtools::install_github(repo="knausb/vcfR")

Note that you will need a compiler to install from GitHub. If you do try it and notice anything, please let me know and we can try again. If you're interested in changes I made, you should be able to see the diff below.

https://github.com/knausb/vcfR/commit/eb7124a59c41219c67f75b379c0b39d6f4758218

Thanks and please let me know if you do not feel this addresses your issue! Brian

alkaZeltser commented 2 years ago

Thank you for responding so quickly! I will try to test out the new code. For reference, I found that GTCtoVCF is an Illumina tool with documentation here: https://github.com/Illumina/GTCtoVCF