Multiallelic positions handling not available in v0.5.0

vilperte commented 4 months ago

Hi @lczech

I am using grenedalf v0.5.0 to calculate FST between a set of samples. My set of samples are coming from 3 different vcf files, so I am using the --sample-group-merge-table-file to create the different groups (it works pretty well).

Moreover, my window-type is a gff3 file containing genes only.

Here is my command:

grenedalf fst --vcf-path set1.bcf --vcf-path set2.bcf --vcf-path set3.bcf \
    --reference-genome-fasta-file genome.fasta \
    --sample-group-merge-table-file grouping.list \
    --filter-total-only-biallelic-snps \
    --window-region-gff gff_genes.gff3 \
    --window-type regions \
    --pool-sizes pools.list \
    --filter-total-snp-min-count 2 \
    --method unbiased-hudson \
    --window-average-policy available-loci \
    --write-pi-tables \
    --comparand-list comparand.list \
    --file-prefix fst_genes_hudson. --threads 4 --log-file fst_genes_hudson.log

I am running into the following error:

Computing FST between 398 pairs of samples.
Streaming over 25351 total regions across 63 chromosomes.
Error: Cannot convert VcfRecord to Variant, as one of the VcfRecord REF or ALT sequences/alleles is not a single nucleotide (it is not a SNP), at chr1:18018. At the time bein
g, we are not supporting indels and other such variants.

The position mentioned is an indel and it is only present on set1.bcf

CHROM  POS     ID      REF     ALT     QUAL
chr1       18018   Repeat  TA      T       4259.94

I also tried doing the following:

use the --filter-region-bed flag and give a bed file that does not contain that position - Did not work
use the --filter-total-only-biallelic-snps flag - Did not work

I had a look in the previous version that I was using (v0.3.0) and there was an implementation to handle multiallelic positions. Has that been removed in the new version? In any case, it should skip non-snp position as default, right?

Any idea how I could solve this problem?

Cheers,

Vinicius

lczech commented 4 months ago

Hey Vinicius @vilperte,

indeed, grenedalf still supports multiallelic positions :-) The issue, as stated in the error message that you are seeing, and as you also pointed out, is that the position is an indel, and not a SNP (biallelic or multiallelic doesn't matter there). That is the part that causes the error. I just had a look of the code handling this in grenedalf v0.3.0, and it should give the same error for this, as indels are treated the same way in that version. I however did some minor refinements of the VCF handling in between the versions which might affect this, but I don't think so unless I missed something. So, if this worked before, that would be strange - could you maybe share your VCF with me in that case?

As for your approaches to solve this:

I also tried doing the following:

use the --filter-region-bed flag and give a bed file that does not contain that position - Did not work

use the --filter-total-only-biallelic-snps flag - Did not work

Hm yes, both of these are applied after the input file parsing, and the error is already caused when reading in the file. We basically assume files to be good first, and then filter later...

For that reason, I think it makes sense that grenedalf simply ignores indels and other non-SNP positions, instead of terminating with an error - thanks for bringing this to my attention! I'll fix that - let's keep this issue open for now. Also, for you to make progress in the meantime, you can simply filter the VCF beforehand to remove non-SNP positions (or set their FILTER field to some non-passing value, that should also work).

Also, if I may, I'm interested in your use case there:

If you are working with VCFs, I'd recommend to have a look at the caveat about using them with pool seq data, see here.

Furthermore, you say

My set of samples are coming from 3 different vcf files, so I am using the --sample-group-merge-table-file to create the different groups (it works pretty well).

I'm not sure that I understand - why are you merging the samples? It sounds a bit like each of your three input files is meant to represent one pool, and you are merging each file into one sample? In that case - where does your VCF come from, and what do the AD counts represent? For instance, if each of the columns in the VCF comes from one sequenced individual each, only to then be merged into an artificial pool, you'd basically not have any sampling bias from the pooling any more. So then using the number of individuals per VCF as the pool size for that VCF would over-correct, as you never had the sampling noise of "normal" pool-seq to begin with. Just want to make sure that what you are doing is actually meeting the statistical assumptions of the Pool-seq approach, as otherwise, the estimators would not make too much sense for your use case.

Happy to hear though that the merging is working well! Also, nice to see that the region window type is useful! I've added that because I thought it was useful, but wasn't quite sure if that is something that is actually needed - very cool!

Cheers and so long Lucas

PS: All the implemented statistics are only developed for SNPs, and not for indels - that would require some completely new statistical approach, and to the best of my knowledge, no one has done that yet for pool seq data. I think that would be an interesting area of research, and I'd be happy to implement this if anyone does the necessary stats first :-)

vilperte commented 4 months ago

Hi @lczech,

Thanks for the quick answer.

Unfortunately I cannot share the vcf files because of project partners involved in the study.

In the input section of the documentation it says:

--vcf-path TEXT:PATH(existing)=[] ... List of vcf/bcf files or directories to process. For directories, only files with the extension .vcf[.gz]|.bcf are processed. To input more than one file or directory, either separate them with spaces, or provide this option multiple times. This expects that the input file has the per-sample VCF FORMAT field AD (alleleic depth) given, containing the counts of the reference and alternative base. This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. We then interpret the AD field as the allele counts of each pool of individuals. Note that only SNP positions are used; positions that contain indels and other non-SNP variants are skipped.

Shouldn't the position then be skipped?

I am planning to filter the vcf for now, but it is quite a large dataset and I was trying to avoid too many tmp files.

And regarding the merging, my use case is more or less what you mentioned. I am merging single samples from a vcf and creating an artificial pool. I know that this is definitely not the correct way, but right now I am doing sort of a proof of concept for the approach. Later on the comparison will be done with real sequenced pools.

Cheers,

Vinicius

lczech commented 4 months ago

Haha oh my...

Note that only SNP positions are used; positions that contain indels and other non-SNP variants are skipped. Shouldn't the position then be skipped?

Well, yes. Seems I documented that from memory instead of looking at what's actually implemented. Sorry for that! I'll fix that as soon as I can - hopefully later today or tomorrow. So if it's not urgent, you can wait instead of filtering your VCF.

Cheers Lucas

vilperte commented 4 months ago

I will keep testing some of the parameters, but will also wait for the skipping implementation then ;)

Thanks again and have a nice day.

Cheers,

Vinicius

lczech commented 4 months ago

Hey @vilperte,

okay, I had a look at this, and the documentation was right, these positions should have been skipped all along, and I actually had checks for this in the code already. However, there was a weird special case that escaped these checks: insertions that also contained the AD field in the FORMAT part. I did not have that in my test data (I do now), and so that had gone unnoticed so far.

Fixed now on the dev branch - please try and report if that works now.

Thanks, so long Lucas

vilperte commented 4 months ago

Hi @lczech,

It works perfectly. Thanks once again for the quick implementation! ;)

Cheers,

Vinicius

lczech commented 4 months ago

Great, I made the fix available in v0.5.1 now. Thanks again for reporting this!

lczech / grenedalf

Multiallelic positions handling not available in v0.5.0 #21