samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
640 stars 241 forks source link

bcftools consensus with deletions starting outside region of interest #2091

Closed nage0178 closed 5 months ago

nage0178 commented 5 months ago

Is there a way for the consensus command to take into account deletions that start outside of the region of interest, analogous to the -r option for other commands? For example, if the sample from individual_ID has a deletion starting at 5219000 and ending at 5221000, the command below will not remove the deletion. I do not want the deletion in the consensus sequence.

samtools faidx reference.fa chr11:5220000-5227000 > reference_region.fa
cat reference_region.fa |bcftools consensus -H 1pIU -s individual_ID --mark-del - --mark-ins lc example.vcf.gz
pd3 commented 5 months ago

There is now a new option --regions-overlap which works similarly to other commands: overlapping variants starting outside of the target region of the fasta file can be ignored with --regions-overlap 0 or taken into account when set to 1 or 2.

Note for the latter there will be ambiguous cases, see https://github.com/samtools/htslib/issues/1746.

Please test this out and let me know if you encounter anything odd.

nage0178 commented 5 months ago

Thank you for your response. The --regions-overlap does not appear to work with the consensus command. The command

cat reference_region.fa|bcftools consensus -H 1pIU -s individual_ID --mark-del - --mark-ins lc --regions-overlap 1 example.vcf.gz

gives the following output

consensus: unrecognized option '--regions-overlap'

About: Create consensus sequence by applying VCF variants to a reference fasta
       file. By default, the program will apply all ALT variants. Using the
       --samples (and, optionally, --haplotype) option will apply genotype
       (or haplotype) calls from FORMAT/GT. The program ignores allelic depth
       information, such as INFO/AD or FORMAT/AD.
Usage: bcftools consensus [OPTIONS] <file.vcf.gz>
Options:
    -c, --chain FILE               Write a chain file for liftover
    -a, --absent CHAR              Replace positions absent from VCF with CHAR
    -e, --exclude EXPR             Exclude sites for which the expression is true (see man page for details)
    -f, --fasta-ref FILE           Reference sequence in fasta format
    -H, --haplotype WHICH          Choose which allele to use from the FORMAT/GT field, note
                                   the codes are case-insensitive:
                                       N: N={1,2,3,..} is the index of the allele from GT, regardless of phasing (e.g. "2")
                                       R: REF allele in het genotypes
                                       A: ALT allele
                                       I: IUPAC code for all genotypes
                                       LR,LA: longer allele and REF/ALT if equal length
                                       SR,SA: shorter allele and REF/ALT if equal length
                                       NpIu: index of the allele for phased and IUPAC code for unphased GTs (e.g. "2pIu")
    -i, --include EXPR             Select sites for which the expression is true (see man page for details)
    -I, --iupac-codes              Output IUPAC codes based on FORMAT/GT, use -s/-S to subset samples
        --mark-del CHAR            Instead of removing sequence, insert character CHAR for deletions
        --mark-ins uc|lc|CHAR      Highlight insertions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
        --mark-snv uc|lc|CHAR      Highlight substitutions in uppercase (uc), lowercase (lc), or use CHAR, leaving the rest as is
    -m, --mask FILE                Replace regions according to the next --mask-with option. The default is --mask-with N
        --mask-with CHAR|uc|lc     Replace with CHAR (skips overlapping variants); change to uppercase (uc) or lowercase (lc)
    -M, --missing CHAR             Output CHAR instead of skipping a missing genotype "./."
    -o, --output FILE              Write output to a file [standard output]
    -p, --prefix STRING            Prefix to add to output sequence names
    -s, --samples LIST             Comma-separated list of samples to include, "-" to ignore samples and use REF,ALT
    -S, --samples-file FILE        File of samples to include
Examples:
   # Get the consensus for one region. The fasta header lines are then expected
   # in the form ">chr:from-to".
   samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

   # See also http://samtools.github.io/bcftools/howtos/consensus-sequence.html

I confirmed I have the latest version of bcftools and the --regions-overlap options works with the view command.

pd3 commented 5 months ago

This feature was added few hours ago by https://github.com/samtools/bcftools/commit/d25e3f1d623091dd4422d68c67a50a1dfa03163a. You'll need to update to the latest github version http://samtools.github.io/bcftools/howtos/install.html