ucscGenomeBrowser / kent

UCSC Genome Browser source tree. Stable branch: "beta".
http://genome.ucsc.edu/
Other
219 stars 89 forks source link

FaToVcf reference issue. #62

Closed Salvobioinfo closed 2 years ago

Salvobioinfo commented 2 years ago

I'm using FaToVcf to extract the SNPs from Multiple alignment format, but even if I give as input the name of correct ref sequence, in the vcf file the reference is completely wrong, it seems as the tool use as reference a consensus sequence.

Any solutions to solve this problem? I need to extract SNPs, but using the original reference otherwise it makes no sense.

AngieHinrichs commented 2 years ago

Hi @Salvobioinfo -- sorry to hear that faToVcf isn't working as you expect. faToVcf uses the first sequence in the input fasta file as reference, unless you add the -ref=seqName option (where seqName is the name of the reference sequence, which must be included in the input fasta file). In order to help us diagnose, I have a few questions:

  1. How did you install faToVcf? bioconda? Download executable? Build from source? When did you last install or update it?
  2. What is the exact faToVcf command that you ran? (It will help to see the exact options used)
  3. Better yet, if you can run the same faToVcf command but add -verbose=2, and then copy-paste both the command and the stderr messages from the terminal into a reply, that will provide additional clues.
  4. Best of all, if you can share the fasta input then I can try to run the same command -- but I understand that it's not always possible to share.
Salvobioinfo commented 2 years ago

Hi @AngieHinrichs, 1.Question: I downloaded the executable following the instruction at this UCSC_link, the 13.11.2021. 2-3. Question: The command I used is: /dir/faToVcf -includeRef -ref=EPI_ISL_402124 -vcfChrom=NC_045512.2 -noGenotypes -ambiguousToN -verbose=2 2021-11-09_masked.fa 21-11-09_masked_EPI_ISL_402124.vcf > fatovcf_NC_045512.2.log 2>&1 & The log file is: fatovcf.log

  1. Question: I cannot share the file, but it’s possible to get it from GISAID. It’s the MSA (masked). Thank you in advance. Best, Salvo
AngieHinrichs commented 2 years ago

Thanks @Salvobioinfo -- it helps to know that you're using the GISAID MSA. That file uses a masking character that faToVcf does not expect: it uses '-' instead of 'N' to mask sites that are known to be problematic when building phylogenetic trees. When faToVcf sees a '-', it interprets that as an alignment gap not an N or masked base. When it sees '-' in the reference sequence, it assumes that some other sequence has an insertion at that location, and it skips over that base in all input sequences. However, if '-' is used as a masking character not an alignment gap, then faToVcf's assumption is incorrect and it results in incorrect coordinates in the output.

Here's how to work around that.

  1. Make a fasta file with only unmasked EPI_ISL_402124 sequence (not extracted from the MSA file -- the original sequence from GISAID), say EPI_ISL_402124.fa.
  2. Edit EPI_ISL_402124.fa to make sure that the sequence name after > is not exactly "EPI_ISL_402124". (The long hCoV-19/.... name is fine, or only "WIV04" -- just not "EPI_ISL_402124".)
  3. Don't use -includeRef and -ref; instead, concatenate EPI_ISL_402124.fa with 2021-11-09_masked.fa like this:
    /dir/faToVcf -vcfChrom=NC_045512.2 -noGenotypes -ambiguousToN -verbose=2 \
    <(cat EPI_ISL_402124.fa 2021-11-09_masked.fa) \
    21-11-09_masked_EPI_ISL_402124.vcf > fatovcf_NC_045512.2.log 2>&1 &

    EPI_ISL_402124 will still be included in the VCF because it is in 2021-11-09_masked.fa, but the directly downloaded, unmasked sequence of EPI_ISL_402124 will be used as the reference, resulting in a masking behavior instead of gap-skipping behavior.

Hope that helps, and please let us know if any of that is unclear or if it doesn't work.

Salvobioinfo commented 2 years ago

Dear @AngieHinrichs. The tool works well now. Thank you. Salvo