harrispopgen / mutyper

Ancestral k-mer mutation types for SNP data
https://harrispopgen.github.io/mutyper/
MIT License
7 stars 3 forks source link

Issue with directionality of the chain file and possible fasta indexing bug. #27

Closed mrvollger closed 3 years ago

mrvollger commented 3 years ago

Hi Will,

I am probably doing something very silly, but I am having trouble getting the ancestor command to work as I would expect.

Looking at the documentation it looks like the chain file should be from the outgroup to the reference (https://harrispopgen.github.io/mutyper/cmd.html#Positional%20Arguments) which in my case gives a chain file with a header that looks like (ref has chromosome names, outgroup has contig names):

chain/out-to-ref/chr10-h1tg000001l.chain:chain 73054 chr10 134758134 + 19131310 19212870 h1tg000001l 19803926 + 18604719 18681137 1 

and not for example the other direction:

chain/ref-to-out/chr10-h1tg000001l.chain:chain 73054 h1tg000001l 19803926 + 18604719 18681137 chr10 134758134 + 19131310 19212870 1

(my source for saying the directionality of these chain files http://genome.ucsc.edu/goldenPath/help/chain.html)

However when I run mutyper ancestor with the first chain all the SNP positions in my VCF files get turned into Ns in the ancestral fasta output leading to an empty vcf from variants. whereas if I run with the second file I get results, but it seems opposite what the documentation suggests if I am reading it right (and we thought the results looked weirdish).

Here is my calls using mutyper

rule make_ancestor:
    input:
        bcf=rules.subset_vcf.output.bcf,
        rgn=rules.subset_vcf.output.rgn,
        chain=rules.make_chain.output.chain,
        ref=rules.prep_ancestor.output.ref,
        out=rules.prep_ancestor.output.out,
    output:
        fasta=temp("temp/mutyper/ancestral_fasta/ref-to-out/{ref}-{out}.fa"),
        fai=temp("temp/mutyper/ancestral_fasta/ref-to-out/{ref}-{out}.fa.fai"),
    log:
        "logs/mutyper/ancestral_fasta/ref-to-out/{ref}-{out}.log",
    conda:
        "../envs/mutyper.yml"
    shell:
        """
        mutyper ancestor \
            --bed {input.rgn} \
            {input.bcf} \
            {input.ref} \
            {input.out} \
            {input.chain} \
        {output.fasta}
        samtools faidx {output.fasta}
        """

rule annotate_vcf:
    input:
        bcf=rules.subset_vcf.output.bcf,
        fasta=rules.make_ancestor.output.fasta,
    output:
        bcf=temp("temp/mutyper/vcf/ref-to-out/{ref}-{out}.bcf"),
        csi=temp("temp/mutyper/vcf/ref-to-out/{ref}-{out}.bcf.csi"),
    log:
        "logs/mutyper/annotate_vcf/ref-to-out/{ref}-{out}.log",
    conda:
        "../envs/mutyper.yml"
    shell:
        """
        mutyper variants {input.fasta} {input.bcf} \
            | bcftools sort -Ob -m 8G - \
            > {output.bcf}
        bcftools index -f {output.bcf}
        """

###############################################################################
###############################################################################
###############################################################################
rule make_ancestor_out_to_ref:
    input:
        bcf=rules.subset_vcf.output.bcf,
        rgn=rules.subset_vcf.output.rgn,
        chain=rules.make_chain.output.chain_out_to_ref,
        ref=rules.prep_ancestor.output.ref,
        out=rules.prep_ancestor.output.out,
    output:
        fasta=temp("temp/mutyper/ancestral_fasta/out-to-ref/{out}-{ref}.fa"),
        fai=temp("temp/mutyper/ancestral_fasta/out-to-ref/{out}-{ref}.fa.fai"),
    log:
        "logs/mutyper/ancestral_fasta/out-to-ref/{ref}-{out}.log",
    conda:
        "../envs/mutyper.yml"
    shell:
        """
        mutyper ancestor \
            --bed {input.rgn} \
            {input.bcf} \
            {input.ref} \
            {input.out} \
            {input.chain} \
        {output.fasta}
        samtools faidx {output.fasta}
        """

rule annotate_vcf_out_to_ref:
    input:
        bcf=rules.subset_vcf.output.bcf,
        fasta=rules.make_ancestor_out_to_ref.output.fasta,
    output:
        bcf=temp("temp/mutyper/vcf/out-to-ref/{out}-{ref}.bcf"),
        csi=temp("temp/mutyper/vcf/out-to-ref/{out}-{ref}.bcf.csi"),
    log:
        "logs/mutyper/annotate_vcf/out-to-ref/{ref}-{out}.log",
    conda:
        "../envs/mutyper.yml"
    shell:
        """
        mutyper variants {input.fasta} {input.bcf} \
            | bcftools sort -Ob -m 8G - \
            > {output.bcf}
        bcftools index -f {output.bcf}
        """

and I have deposited an example of all the data I am working with here: https://eichlerlab.gs.washington.edu/help/mvollger/share/mutyper/small_example/

Dir structure to help navigation:

 mutyper
├──  all.bcf
├──  all.bcf.csi
├──  ancestral_fasta
│  ├──  out-to-ref
│  │  ├──  h1tg000001l-chr10.fa
│  │  └──  h1tg000001l-chr10.fa.fai
│  └──  ref-to-out
│     ├──  chr10-h1tg000001l.fa
│     └──  chr10-h1tg000001l.fa.fai
├──  chain
│  ├──  out-to-ref
│  │  └──  h1tg000001l-chr10.chain
│  └──  ref-to-out
│     └──  chr10-h1tg000001l.chain
├──  input_fasta
│  ├──  out_h1tg000001l-chr10.fa
│  ├──  out_h1tg000001l-chr10.fa.fai
│  ├──  ref_chr10-h1tg000001l.fa
│  └──  ref_chr10-h1tg000001l.fa.fai
├──  psl
│  └──  clint.psl
├──  subset
│  ├──  rgn
│  │  └──  chr10-h1tg000001l.rgn
│  └──  vcf
│     └──  chr10-h1tg000001l.bcf
└──  vcf
   ├──  out-to-ref
   │  ├──  h1tg000001l-chr10.bcf
   │  └──  h1tg000001l-chr10.bcf.csi
   └──  ref-to-out
      ├──  chr10-h1tg000001l.bcf
      └──  chr10-h1tg000001l.bcf.csi

Also a perhaps related issue that I often get errors from ancestor when it tries to rebuild an fai even thought it has already been made in a pervious setup step. This becomes an issues when multiple ancestor commands run at the same time I think.

Traceback (most recent call last):                                                                          
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/shutil.py", line 806, in move                                                      
    os.rename(src, real_dst)                                                                                
FileNotFoundError: [Errno 2] No such file or directory: 'temp/mutyper/input_fasta/out_h1tg000001l-chr10.fa.fa
i.tmp' -> 'temp/mutyper/input_fasta/out_h1tg000001l-chr10.fa.fai'                                           

During handling of the above exception, another exception occurred:                                         

Traceback (most recent call last):                                                                          
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/site-packages/pyfaidx/__init__.py", line 600, in build_index                       
    shutil.move(self.indexname + '.tmp', self.indexname)                                                    
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/shutil.py", line 826, in move                                                      
    copy_function(src, real_dst)                                                                            
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/shutil.py", line 435, in copy2                                                     
    copyfile(src, dst, follow_symlinks=follow_symlinks)                                                     
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/shutil.py", line 264, in copyfile                                                  
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:                                                  
FileNotFoundError: [Errno 2] No such file or directory: 'temp/mutyper/input_fasta/out_h1tg000001l-chr10.fa.fa
i.tmp'                                                                                                      

During handling of the above exception, another exception occurred:                                         

Traceback (most recent call last):                                                                          
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/bin/mutyper", line 8, in <module>                                                                
    sys.exit(main())                                                                                        
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/site-packages/mutyper/cli.py", line 336, in main                                   
    args.func(args)                                                                                         
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/site-packages/mutyper/cli.py", line 39, in ancestral_fasta                         
    out = pyfaidx.Fasta(args.outgroup, read_ahead=10000)                                                    
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/site-packages/pyfaidx/__init__.py", line 998, in __init__                          
    self.faidx = Faidx(                                                                                     
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/site-packages/pyfaidx/__init__.py", line 435, in __init__                          
    self.build_index()                                                                                      
  File "/net/eichler/vol26/projects/chm13_t2t/nobackups/sd-divergence/.snakemake/conda/9d645df7917cec5d8d94c2
745d1f65b1/lib/python3.9/site-packages/pyfaidx/__init__.py", line 603, in build_index                       
    raise IOError(                                                                                          
OSError: temp/mutyper/input_fasta/out_h1tg000001l-chr10.fa.fai may not be writable. Please use Fasta(rebuild=
False), Faidx(rebuild=False) or faidx --no-rebuild.

Thanks in advance, Mitchell

mrvollger commented 3 years ago

My troubles with chain files were due to my own (hopefully relatable) misunderstandings which I will explain in case they help someone else.

From the UCSC site a chain file header line is defined with these names

chain score tName tSize tStrand tStart tEnd qName qSize qStrand qStart qEnd id 

I assumed that if the target (t) was hg18 for example and that the query (q) was hg17 that this would be called anhg17-to-hg18.chain chain because that is the common way to describe an alignment… but it seems the naming convention is opposite for liftOver and alignment files so this situation would actually be named hg18-to-hg17.chain. So just some confusion about UCSC naming from me.

Still looking into the indexing error I had.

mrvollger commented 3 years ago

I think I fixed the pyfaidx error in https://github.com/harrispopgen/mutyper/pull/28, but its a little unclear to me since it only crops up at random and I cannot get a case the reproduces it every time.

wsdewitt commented 3 years ago

Let's leave open until the fix is merged