Add information for SNP filtering to methods

yaaminiv commented 8 months ago

Comment from Katherine on genetic variation methods: Did you filter the SNPs by missing data or minor allele frequency prior to comparisons? If so, provide those values here, even if they are a default in the pipeline.

@kubu4, Were SNPs filtered before clustering with EpiDiverse/snp?

@sr320, were SNPs filtered before correlating genetic distance with methylation/gene expression distance?

sr320 commented 8 months ago

This seems to be file used for SNP matrix - https://github.com/sr320/ceabigr/blob/main/output/53-revisit-epi-SNPs/epiMATRIX_mbd_rab.txt

see https://rpubs.com/sr320/1168363

sr320 commented 8 months ago

Which came from

/home/shared/ngsRelate/ngsRelate/ngsRelate \
-h ../output/51-SNPs/EpiDiv_merged.f.recode.vcf \
-T GT \
-c 1 \
-z ../output/53-revisit-epi-SNPs/sample.txt \
-O ../output/53-revisit-epi-SNPs/vcf.relatedness

sr320 commented 8 months ago

which came from

/home/shared/vcftools-0.1.16/bin/vcftools \
--vcf ../output/51-SNPs/EpiDiv_merged.vcf \
--recode --recode-INFO-all \
--min-alleles 2 --max-alleles 2 \
--max-missing 0.5 \
--mac 2 \
--out ../output/51-SNPs/EpiDiv_merged.f.recode.vcf

sr320 commented 8 months ago

so to

were SNPs filtered before correlating genetic distance with methylation/gene expression distance?

yes,

--min-alleles 2 --max-alleles 2: This filters the data to include only bi-allelic sites, meaning sites that have exactly two alleles (one could be the reference allele, and the other is the variant allele). --max-missing 0.5: This is a filter that sets a maximum allowable proportion of individuals with missing data at each site. In this case, it discards any variants where more than 50% of the data is missing. --mac 2: This option tells the program to only include sites with a Minor Allele Count of at least 2. This means that the less common variant must appear at least twice in your sample.

sr320 commented 8 months ago

@yaaminiv can you give more context for

Were SNPs filtered before clustering with EpiDiverse/snp

yaaminiv commented 8 months ago

@yaaminiv can you give more context for

Were SNPs filtered before clustering with EpiDiverse/snp

Poor wording on my part...meant was there any sort of internal filtering that occurred with EpiDiverse/snp when the clustering was done

sr320 commented 8 months ago

for reference: here is code in question - https://robertslab.github.io/sams-notebook/posts/2022/2022-09-21-BSseq-SNP-Analysis---Nextflow-EpiDiverse-SNP-Pipeline-for-C.virginica-CEABIGR-BSseq-data/index.html

sr320 commented 8 months ago

My interpretation is EpiDiverse useses FreeBays -

Variant calling

Variant calling is performed with Freebayes, on whole genome bisulfite sequencing data which has been masked in bisulfite contexts and can be thus interpreted as normal sequencing data. Statistics are estimated with bcftools stats and plotted with plot_vcfstats.

Output directory: snps/vcf/

*.vcf.gz
The full results from Freebayes (parallel), run using the following options:
--no-partial-observations
--report-genotype-likelihood-max
--genotype-qualities
--min-repeat-entropy <ARG>
--min-coverage <ARG>

et's break down the specified options:

--no-partial-observations: This option tells Freebayes not to consider partially observed genotypes in the analysis. Partially observed genotypes occur when not all alleles at a site are observed due to sequencing errors, low coverage, or other issues. Excluding these can improve the accuracy of variant calls by relying only on more complete data. --report-genotype-likelihood-max: With this option, Freebayes will report the genotype with the maximum likelihood in its output. Genotype likelihood is a measure of how probable a particular genotype is given the sequencing data. Reporting the maximum likelihood genotype helps in identifying the most likely genetic variant at each position in the genome. --genotype-qualities: This flag prompts Freebayes to output the quality of the called genotypes. Genotype quality is a metric that quantifies the confidence in the genotype call at each variant site. High-quality scores indicate a high confidence in the genotype call, which is useful for filtering and downstream analyses. --min-repeat-entropy : This option sets a minimum threshold for repeat entropy at variant sites. Repeat entropy is a measure of the complexity of repeated sequences around a variant. Low entropy repeats are simpler and more prone to sequencing and alignment errors. Setting a minimum threshold helps in avoiding variants called in low complexity regions, which can be less reliable. --min-coverage : This option specifies the minimum read coverage required to consider a site for variant calling. Read coverage refers to the number of times a nucleotide is sequenced. Higher coverage provides more evidence for variant calls, making them more reliable. This threshold ensures that only genomic regions with sufficient sequencing depth are analyzed for variants.

So in theory? @kubu4 could set in some param file?

yaaminiv commented 8 months ago

So in theory? @kubu4 could set in some param file?

Based on the script it doesn't seem like any parameters were set...so maybe no filtering at that step?

sr320 commented 8 months ago

I feel like if not set... default setting would have to be implemented.

kubu4 commented 8 months ago

EpiDiverse/snp has the following defaults when running Freebayes:

    entropy = 1
    coverage = 0
    regions = 100000
    ploidy = 2

https://github.com/EpiDiverse/snp/blob/64a73485dc966eabc5cfe79f5d3713f3c0db8a84/nextflow.config#L26-L30

So:

--min-coverage is: 0
--min-repeat-entropy is: 1

sr320 / ceabigr

Add information for SNP filtering to methods #109