knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

vcfR: ID column contains non-unique names #170

Closed gimeno99 closed 3 years ago

gimeno99 commented 3 years ago

@knausb

Hi Brian, I am trying to convert the vcf file into double format through R and getting this error. I have split the files using Data slicer and merged the files together using TASSEL GUI. when I load the ind vcf files, i am able to use extract.gt(x) command on them and able to convert to double. while creating split vcf files, i made sure that, the positions are chosen not repetitively and hence not sure what the issue is. `VCFm <- vcfR::read.vcfR("~/x", verbose = FALSE) VCFm VCFm@gt ## to show the content of the S4 vcf file mtx_vcfM<- extract.gt(VCFm, element = 'DP', as.numeric = TRUE) ## parse the file --> this is where I am getting the error as - Error in extract.gt(x, element = "DP", as.numeric = TRUE) : ID column contains non-unique names

The merged file is here for analysis.- https://app.mediafire.com/myfiles#myfiles

Note: i went through mot of the answers for this error and none of them seemed to solve the issue. It might be ID duplicate but how do we verify that.

knausb commented 3 years ago

Hi @gimenomala ,

The ID column is supposed to contain a unique identifier for each variant. Please validate this by consulting the specification. The error you've posted seems to indicate that you have non-unique names. You should be able to validate this as follows.

sort(table(getID(VCFm)), decreasing = TRUE)[1:10]

If this validates your problem then we can move on to a solution. A possible "quick fix" would be to add a subscript to the ID values (e.g., _1, _2, _3, etc.). But what I really think you need to do is go back through your analytic steps and figure out where this issues was introduced because it sounds like there was an error somewhere in there.

Please let me know if this helps! Brian

gimeno99 commented 3 years ago

Hi @knausb , I am unable to troubleshoot and find whats wrong with ID column and while verifying using the sort command, I dont see any issues. I am not that expert in R and hence unable to subscript and add add or append the ID section with 1,2,3 as you have suggested. so the issue is still there. But now, I found a strange issue now while reading the vcf file in R through vcfR. I am getting truncated value of FORMAT section, I mean only GT values are getting displayed and rest are getting truncated. I created the files from the whole genome set to a smaller one using data slicer tool, and directly unzipped using gunzip in R and reading with vcfR.

head(vcf175_1) [1] " Object of class 'vcfR' " [1] " Meta section " [1] "##fileformat=VCFv4.1" [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">" [1] "##fileDate=20150218" [1] "##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/refe [Truncated]" [1] "##source=1000GenomesPhase3Pipeline" [1] "##contig=" [1] "First 6 rows." [1] [1] " Fixed section " CHROM POS ID REF ALT QUAL FILTER [1,] "22" "17420222" "esv3647204" "A" "" "100" "PASS" [2,] "22" "17500036" "rs17204993" "T" "C" "100" "PASS" [3,] "22" "17500062" "rs558880693" "G" "A" "100" "PASS" [4,] "22" "17500098" "rs577294956" "C" "T" "100" "PASS" [5,] "22" "17500120" "rs544700133" "A" "G" "100" "PASS" [6,] "22" "17500154" "rs562884337" "T" "C" "100" "PASS" [1] [1] " Genotype section " FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 [1,] "GT" "0|0" "0|0" "0|0" "0|0" "0|0"
[2,] "GT" "0|0" "0|0" "0|0" "0|0" "0|0"
[3,] "GT" "0|0" "0|0" "0|0" "0|0" "0|0"
[4,] "GT" "0|0" "0|0" "0|0" "0|0" "0|0"
[5,] "GT" "0|0" "0|0" "0|0" "0|0" "0|0"
[6,] "GT" "0|0" "0|0" "0|0" "0|0" "0|0"
[1] "First 6 columns only." [1] [1] "Unique GT formats:" [1] "GT"

Do you know what could be the issue? Is it because of this that ID column is also giving error as non-unique?

gimeno99 commented 3 years ago

To add on, while using queryMETA(x) also, I am only seeing the below as output, and nothing from FORMAT section.

queryMETA(vcf175_1, element = 'DP') ## fixed meta data display, specific to filter=DP [[1]] [1] "INFO=ID=DP"
[2] "Number=1"
[3] "Type=Integer"
[4] "Description=Total read depth; only low coverage data were counted towards the DP"

while using this command, I get as below, which is strange.. :(

queryMETA(vcf175, element = 'FORMAT=<ID=DP') list()

knausb commented 3 years ago

Hi @gimenomala ,

The function vcfR::read.vcfR() reads gzipped files, so there should be no need to gunzip your files.

I was not able to access the file you posted so I can not comment on it.

From the information you posted it appears you are working with data from the 1000 genomes project. This indicates to me that you started with valid files and it is one of your processing steps that is introducing issues. I think you need to review your work to make sure you understand what you're doing and whether there are any other issues you are introducing.

You've asserted that your data was "truncated" but I see no evidence of this. In the past when I've worked with 1000 genomes data they only had genotypes (GT). Your format column only includes "GT" which reinforces my perspective. Is there any reason you would expect anything besides genotypes? Remember, a VCF file is just a text file, so you can open it with less, more, vim, etc.

Your "DP" data is in the "INFO" column and not in the "FORMAT" column which indicates that there is no reason to expect "DP" to be associated with any of the genotypes.

Good luck! Brian

gimeno99 commented 3 years ago

hi Brian, @knausb what I meant by saying - that GT does not show relevant data is - while using extract.gt(x, element='DP') I am expecting the Depth values also to be displayed in the extracted matrix whereas in this case, I only get to see GT in first column, and hence unable to get numerical values in the final matrix. from your paper, I see that extract,gt() command on vcf_test file resulted in results where the va,ues are available for GT:GQ:DP etc where the file which I read only displayes GT value. adding the doc file for more info. vcf-170.docx

adding 2 vcf.gz files which i am analysing..

22.51050075-53550074.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 22.36050075-38550074.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

knausb commented 3 years ago

Hi @gimenomala

I looked at

22.51050075-53550074.ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

and, as I suspected and stated previously, it only includes GT information. You are asking me why my software does not return information that does not exist. Please see the documentation I've already posted.

https://knausb.github.io/vcfR_documentation/vcf_data.html

You may also want to consult the VCF specification.

Good luck! Brian

gimeno99 commented 3 years ago

Thank you, I will try to get a sample where i has DP values, thanks much