Closed Salvobioinfo closed 2 years ago
Hi @Salvobioinfo -- sorry to hear that faToVcf isn't working as you expect. faToVcf uses the first sequence in the input fasta file as reference, unless you add the -ref=seqName option (where seqName is the name of the reference sequence, which must be included in the input fasta file). In order to help us diagnose, I have a few questions:
-verbose=2
, and then copy-paste both the command and the stderr messages from the terminal into a reply, that will provide additional clues.Hi @AngieHinrichs, 1.Question: I downloaded the executable following the instruction at this UCSC_link, the 13.11.2021. 2-3. Question: The command I used is: /dir/faToVcf -includeRef -ref=EPI_ISL_402124 -vcfChrom=NC_045512.2 -noGenotypes -ambiguousToN -verbose=2 2021-11-09_masked.fa 21-11-09_masked_EPI_ISL_402124.vcf > fatovcf_NC_045512.2.log 2>&1 & The log file is: fatovcf.log
Thanks @Salvobioinfo -- it helps to know that you're using the GISAID MSA. That file uses a masking character that faToVcf does not expect: it uses '-' instead of 'N' to mask sites that are known to be problematic when building phylogenetic trees. When faToVcf sees a '-', it interprets that as an alignment gap not an N or masked base. When it sees '-' in the reference sequence, it assumes that some other sequence has an insertion at that location, and it skips over that base in all input sequences. However, if '-' is used as a masking character not an alignment gap, then faToVcf's assumption is incorrect and it results in incorrect coordinates in the output.
Here's how to work around that.
EPI_ISL_402124.fa
.EPI_ISL_402124.fa
to make sure that the sequence name after >
is not exactly "EPI_ISL_402124". (The long hCoV-19/.... name is fine, or only "WIV04" -- just not "EPI_ISL_402124".)EPI_ISL_402124.fa
with 2021-11-09_masked.fa
like this:
/dir/faToVcf -vcfChrom=NC_045512.2 -noGenotypes -ambiguousToN -verbose=2 \
<(cat EPI_ISL_402124.fa 2021-11-09_masked.fa) \
21-11-09_masked_EPI_ISL_402124.vcf > fatovcf_NC_045512.2.log 2>&1 &
EPI_ISL_402124 will still be included in the VCF because it is in 2021-11-09_masked.fa
, but the directly downloaded, unmasked sequence of EPI_ISL_402124 will be used as the reference, resulting in a masking behavior instead of gap-skipping behavior.
Hope that helps, and please let us know if any of that is unclear or if it doesn't work.
Dear @AngieHinrichs. The tool works well now. Thank you. Salvo
I'm using FaToVcf to extract the SNPs from Multiple alignment format, but even if I give as input the name of correct ref sequence, in the vcf file the reference is completely wrong, it seems as the tool use as reference a consensus sequence.
Any solutions to solve this problem? I need to extract SNPs, but using the original reference otherwise it makes no sense.