knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Error in extract.gt(vcf) - ID column contains non-unique names #172

Closed gimeno99 closed 4 years ago

gimeno99 commented 4 years ago

Hi, I am trying to extract the GT from a vcf file for few specific populations and everytime, I am getting error as Error in extract.gt(vcf10pop1, element = "GT", as.numeric = TRUE, IDtoRowNames = TRUE) : ID column contains non-unique names How do we make the ID column unique - is there any easy R code to assign unique name for ID so that I wont get this error please. I am not an expert in R and starting to work with vcf file since this summer only.

command was: vcf1 <- extract.gt(vcf, element = 'GT', as.numeric = TRUE, IDtoRowNames = TRUE) error: Error in extract.gt(vcf10pop1, element = "GT", as.numeric = TRUE, IDtoRowNames = TRUE) : ID column contains non-unique names

data: is from 1000genome file for 2 populations from each super population, between the snips from 17500001-20000000

knausb commented 4 years ago

This appears highly redundant to #170 . The VCFv4.3 Specification in section 1.6.1 states that the ID should contain unique identifiers. The ERROR you're reporting is trying to tell you that there are issues with your VCF file. This does not appear to have anything to do with vcfR. I've previously (#170) showed you how to identify these issues. I feel the real concern is "why do you have non-unique names". Is it because of processing steps you mentioned in #170?

We've invested a lot of time and effort providing documentation for new users such as yourself.

https://knausb.github.io/vcfR_documentation/ http://grunwaldlab.github.io/Population_Genetics_in_R/index.html https://knausb.github.io/vcfR_documentation/reporting_issue.html

Please take the time to work through these documents. They appear to address many of your issues.

gimeno99 commented 4 years ago

hi @knausb I do understand I saw similar issue in #170 , the query here is - how to address if we see non-unique names in rows error. I do need to make the row names unique and I was struggling to get the R code to make these rows unique by appending with chrom ID, position. The suggestion was to add subscript to the duplicates, but I am a little unfamiliar in finding and replacing the snp values. The query sort(table(getID(VCFm)), decreasing = TRUE)[1:10] returned duplicates as below, but I am unable to proceed further to append them with _1, _2 or _3. So I believe, I need to append first 4 SNPs with _1,_2

rs141796829 rs11471553 rs202131091 rs71329353 esv3644940 esv3644941 esv3644942 esv3644943 esv3644944
3 2 2 2 1 1 1 1 1