lvclark / polyRAD

Genotype Calling with Uncertainty from Sequencing Data in Polyploids 🍌🍓🥔🍠🥝
24 stars 8 forks source link

Error in RADdata2VCF #38

Closed jmgorospe closed 1 year ago

jmgorospe commented 1 year ago

Hello Lindsay,

Thanks for the polyRAD package. I am trying to use it to estimate polyploid genotypes for my supposedly polyploid RADseq data (which are samples from populations in the wild), for which there is not a reference genome available. My goal is to use the estimated polyploid genotypes for further analysis starting from a VCF file.

I imported data from STACKS using readStacks with the 'catalog', 'matches files' and the 'sumstatsFile'. I was able to follow the tutorial "Estimating genotype probabilities in a diversity panel" but I have an issue when trying to export the RADdata object to VCF format using the RADdata2VCF function. I get this error message:

Error in RADdata2VCF(myRADdata, file = "file.vcf",  : 
     Complete haplotype information not provided; unable to determine SNP positions.  Use refgenome argument in VCF2RADdata.

It seems I need the 'refgenome argument', but to my understanding I am lacking this. Do have any idea how can I come around this?

Thanks

lvclark commented 1 year ago

You can hack this with

attr(myRADdata$alleleNucleotides, "Variable_sites_only") <- FALSE

However, please be aware that variant positions in the output VCF may be incorrect, and it should not be used for functional annotation (e.g. determining protein coding consequences).

jmgorospe commented 1 year ago

Thanks for the reply. I tried your suggestion and I got an error message that I do not understand and I would appreciate some help. Here is the code:

> attr(myRADdata$alleleNucleotides, "Variable_sites_only") <- FALSE
> RADdata2VCF(myRADdata file = "myRADdata.vcf", asSNPs = TRUE, hindhe = TRUE)
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : 
     In range 21: at least two out of 'start', 'end', and 'width', must be supplied.
In addition: Warning message:
In RADdata2VCF(myRADdata, file = "myRADdata.vcf",  :
     Reference allele not indicated.  Using the major allele for each locus.

Thanks for the warning about functional annotation, I plan to use the VCF for some demographic analyses and compare with results with the diploidized dataset

lvclark commented 1 year ago

I'm not sure it will solve the issue, but here is something to try. My best guess is it is a problem with how alleleNucleotides is formatted.

an2 <- gsub("[\\.\\-]", "", myRADdata$alleleNucleotides") # Remove indel characters
blank_alleles <- which(nchar(an2) == 0) # Alleles with nothing but indel characters
blank_loci <- unique(myRADdata$alleles2loc[blank_alleles])
ok_loci <- setdiff(myRADdata$alleles2loc, blank_loci) # Loci that don't have any "blank" alleles
myRADdata2 <- SubsetByLocus(myRADdata, ok_loci) # Subset object
jmgorospe commented 1 year ago

Hi Lindsay,

Thanks for the reply.

I tried your suggestion but, unfortunately, I still got the same error and warning.

I have attached a subset of my RADdata object in case it would help to replicate the issue. myRADdata_subset.zip

lvclark commented 1 year ago

Thanks for your patience. I was able to eliminate the error with:

myRADdata_subset2 <- SubsetByLocus(myRADdata_subset, which(!is.na(myRADdata_subset$locTable$Chr)))

RADdata2VCF(myRADdata_subset2, file = "myRADdata.vcf", asSNPs = TRUE, hindhe = TRUE)

That warning should not be a problem if you are just using the VCF for population genetics.

jmgorospe commented 1 year ago

Hi Lindsay,

Thanks for the reply. I tried the code you suggested and I succeed to generate the VCF file.

Thank you very much for your help with the issue!