ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

VCF from fast file? #64

Closed olgakozhar closed 1 year ago

olgakozhar commented 2 years ago

Describe the problem you are having

Hi there, I am working with a set of genes from two populations, alignments of which are in fasta format. This fasta alignment was generated by the pipeline of various filtering steps using several programs. I have read the suggestions on the Pixy website on the way to generate VCF with invariant sites, but both of those suggestions include working with bam files. Does anyone by chance know if there is a way to covert fasta file into VCF with invariant sites (store all sites, not just snps from fasta in vcf)?

I came across snp-sites tool that extracts snps from fasta files, it has an argument to keep monomorphic sites, but when I ran it it does not actually store monomorphic sites in the VCF.

Thank you Olga

ksamuk commented 2 years ago

Hi Olga,

I'm not aware of a tool that converts FASTA to VCF and preserves monomorphic sites. This isn't fully tested, but if you'd like to work in R, the below is a re-implementation of the pixy algorithm for pi (only). It only works for single populations.

library("ape")
library("tidyverse")

pixy_pi <- function(data){

  # convert the ape DNA object to a data frame
  dnas <- as.data.frame(as.character(data))

  # create table of alleles from FASTA file
  allele_count_tab <- apply(dnas, 1, table)

  if(is.numeric(allele_count_tab)){

    allele_count_tab <- data.frame(a = allele_count_tab)

  } else{

    allele_count_tab <- allele_count_tab %>% 
      bind_rows() %>% 
      mutate_if(is.table, as.numeric)

  }

  # remove n counts (not used)
  if ("n" %in% names(allele_count_tab)){

    allele_count_tab <- allele_count_tab %>% 
      select(-n)

  }

  if ("N" %in% names(allele_count_tab)){

    allele_count_tab <- allele_count_tab %>% 
      select(-N)

  }

  # convert alleles to major + minor allele counts

  # initialize a new dataframe 
  maj_min_tab <- data.frame(maj = rep(NA, nrow(allele_count_tab)),
                            min = rep(NA, nrow(allele_count_tab)))

  # assign major/minor allele 
  for (i in 1:nrow(allele_count_tab)){

    alleles <- allele_count_tab[i,]

    # if there are more than two alleles at the size, drop it
    if (sum(alleles != 0, na.rm = TRUE) > 2){

      maj_min_tab[i,] <- data.frame(maj = NA, min = NA)

    } else{

      alleles_sorted <- alleles %>% as.numeric %>% sort(decreasing = TRUE)

      if (length(alleles_sorted) == 1){

        maj_min_tab[i,] <- data.frame(maj = alleles_sorted, min = 0)

      } else{

        maj_min_tab[i,] <- data.frame(maj = alleles_sorted[1], min = alleles_sorted[2])

      }

    }

  }

  # calculate pi from maj/minor alleles
  pi_calc_df <- maj_min_tab %>%
    mutate(n_diffs = maj * min) %>%
    mutate(n_gts = maj + min) %>%
    mutate(n_comps = choose(n_gts,2))

  pixy_pi <- sum(pi_calc_df$n_diffs, na.rm = TRUE)/sum(pi_calc_df$n_comps, na.rm = TRUE)

  return(data.frame(pixy_pi))

}

# example usage
# only works for a single population
my_data_pop1 <- read.FASTA("my_data_pop1.fasta")
pixy_pi(my_data_pop1)
olgakozhar commented 2 years ago

Hi Kieran,

thank you for sharing the code. I am giving it a try, it has been running over the weekend and still running. Will keep everyone posted how it went.

Olga