lvclark / polyRAD

Genotype Calling with Uncertainty from Sequencing Data in Polyploids 🍌🍓🥔🍠🥝
24 stars 8 forks source link

number of loci #16

Closed cculma closed 3 years ago

cculma commented 3 years ago

Hi, I am using polyRAD to generate allele dosage in alfalfa. I used NGSEP (https://sourceforge.net/p/ngsep/wiki/Home/) to generate the VCF. However the number of loci generated by VCF2RADdata is lower compared with the original VCF. Do you know the way to obtain the same number of sites? Thank you.

myRAD7 <- VCF2RADdata("logan_8.vcf.bgz",
                      refgenome = "MtrunA17r5.0-20161119-ANR.fasta",
                      al.depth.field = "ACN",
                      possiblePloidies = list(4))
> myRAD7
## RADdata object ##
272 taxa and 3807 loci
3811212 total reads
Assumed sample cross-contamination rate of 0.001

Possible ploidies:
Autotetraploid (4)
logan8 <- readVcf("logan_8.vcf")
> logan8
class: CollapsedVCF 
dim: 6862 272 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 12 columns: CNV, TA, TID, TGN, TCO, TACH, NS, MAF, AN, AFS, TYPE, FS
info(header(vcf)):
        Number Type    Description                                                                                                 
   CNV  1      Integer Number of samples with CNVs around this variant                                                             
   TA   1      String  Variant annotation based on a gene model                                                                    
   TID  1      String  Id of the transcript related to the variant annotation                                                      
   TGN  1      String  Name of the gene related to the variant annotation                                                          
   TCO  1      Float   One based codon position of the start of the variant. The decimal is the codon position                     
   TACH 1      String  Description of the aminoacid change produced by a non-synonymous mutation. String encoded as reference am...
   NS   1      Integer Number of samples genotyped                                                                                 
   MAF  1      Float   Minor allele frequency                                                                                      
   AN   1      Integer Number of alleles in called genotypes                                                                       
   AFS  R      Integer Allele counts over the population for all alleles, including the reference                                  
   TYPE 1      String  Type of variant                                                                                             
   FS   1      Float   Phred-scaled p-value using Fisher's exact test to detect strand bias                                        
geno(vcf):
  List of length 7: GT, PL, GQ, DP, ADP, BSDP, ACN
geno(header(vcf)):
        Number Type    Description                                                                                                 
   GT   1      String  Genotype                                                                                                    
   PL   G      Integer Phred-scaled genotype likelihoods rounded to the closest integer                                            
   GQ   1      Integer Genotype quality                                                                                            
   DP   1      Integer Read depth                                                                                                  
   ADP  R      Integer Counts for observed alleles, including the reference allele                                                 
   BSDP 4      Integer Number of base calls (depth) for the 4 nucleotides in called SNVs sorted as A,C,G,T                         
   ACN  R      Integer Predicted copy number of each allele taking into account the prediction of number of copies of the region...

Best,

Cesar

lvclark commented 3 years ago

Phasing and/or filtering would be the reason you get a different number of markers. If you need the RADdata object to match your VCF exactly, add the options:

phaseSNPs = FALSE, min.ind.with.reads = 0, min.ind.with.minor.allele = 0

It also looks like you should do al.depth.field = "ADP", not ACN.

cculma commented 3 years ago

Thank you Lindsay! It works fine.

Best,

Cesar