harrispopgen / mutyper

Ancestral k-mer mutation types for SNP data
https://harrispopgen.github.io/mutyper/
MIT License
7 stars 3 forks source link

Strandedness issue #14

Closed RenzoTale88 closed 3 years ago

RenzoTale88 commented 3 years ago

Hello, I'm trying to run mutyper on a series of custom VCF, and I'm coming across issues in specifying a bed file with positions for the negative strand exons. I got a file as follow:

Contig3 3727    4034
Contig3 4649    5588

Where the interval specify the positions of the negative stranded exons. However, when I pass it to mutyper, I get the following error:

Traceback (most recent call last):
  File "/path/to/mutyperenv/bin/mutyper", line 8, in <module>
    sys.exit(main())
  File "/path/to/mutyperenv/lib/python3.8/site-packages/mutyper/cli.py", line 308, in main
    args.func(args)
  File "/path/to/mutyperenv/lib/python3.8/site-packages/mutyper/cli.py", line 104, in variants
    anc_kmer, der_kmer = ancestor.mutation_type(variant.CHROM,
  File "/path/to/mutyperenv/lib/python3.8/site-packages/mutyper/ancestor.py", line 106, in mutation_type
    if not self._revcomp_func(chrom, pos):
  File "/path/to/mutyperenv/lib/python3.8/site-packages/mutyper/ancestor.py", line 61, in _reverse_strand
    if pos < self.strandedness[chrom][closest_idx][1]:
IndexError: list index out of range

Is there problem related to the bed file? Or a way to understand the offending position?

Thanks in advance Andrea

wsdewitt commented 3 years ago

Hi Andrea, thanks for noting this issue. I'm not sure what's causing this, so I'll have to run more tests on this feature. I'm afraid it'll be a few weeks before I'll have time to dig into debugging this, but if you'd like to send along minimal inputs that reproduce this error, I can debug based on that.

RenzoTale88 commented 3 years ago

@WSDeWitt sure no problem. I'll create some shared folder to send you as soon as I can. Thanks for the help!

Andrea

RenzoTale88 commented 3 years ago

@WSDeWitt sorry for never coming back with this. I've started looking into this since come across the issue one more on a separate dataset. I'm using the software installed in an anaconda environment with python v3.9. The problem arises when the position of a variant comes after the latest interval provided in the bed file. If I add the following code at line 107 of ancestor.py to check the array:

try:
            self._revcomp_func(chrom, pos)
        except:
            closest_idx = bisect.bisect(self.strandedness[chrom], (pos, -1))
            sys.stderr.write(f'Variant: {chrom} {pos}\n')
            sys.stderr.write(f'Index in array from bisect.bisect: {closest_idx}\n')
            sys.stderr.write(f"Length of the strandedness array: {len(self.strandedness[chrom])}\n")
            for i in range(-4, 0): sys.stderr.write(f"Index position {i}: {self.strandedness[chrom][i]}\n")

I can see the variant creating problem like this:

Variant: 18 65895671
Index in array from bisect.bisect: 15074
Length of the strandedness array: 15074
Index position -4: (65892711, 65892782)
Index position -3: (65893218, 65893402)
Index position -2: (65893751, 65894904)
Index position -1: (65895556, 65895639)

The problem using the index coming out of bisect.bisect as it is. The index from this command seems to be 1-based as pointed (here)[https://docs.python.org/3.9/library/bisect.html]. The problem is that it calls the wrong interval consistently for every variant in the dataset. For example, if I use a variant falling into the interval in position -2 (18, 65894668) I get the following output:

Variant: 18 65894667
Index in array from bisect.bisect: 15073
Length of the strandedness array: 15074
Closest index position 15073: (65895556, 65895639)
Real index position 15072: (65893751, 65894904)

The problem can be easily solved by changing the code as follow:

    def _reverse_strand(self, chrom: str, pos: int):
        r"""return True if strand_file indicates reverse complementation at
        this site"""
        closest_idx = bisect.bisect(self.strandedness[chrom], (pos, -1))
        if pos < self.strandedness[chrom][closest_idx - 1][1]:
            return True
        return False

or again

    def _reverse_strand(self, chrom: str, pos: int):
        """return True if strand_file indicates reverse complementation at
        this site"""
        closest_idx = bisect.bisect(self.strandedness[chrom], (pos, -1)) - 1
        if pos < self.strandedness[chrom][closest_idx][1]:
            return True
        return False

Hope it helps Andrea

wsdewitt commented 3 years ago

Hi Andrea, thank you for digging into this! I haven't had the time to investigate, but if you have a suggested fix (like the code above), I'd be happy to review a PR.

Ideally we'd also have a unit test for this in mutyper/tests. That's something I should have added before, and might have caught this before it became a problem for you.

RenzoTale88 commented 3 years ago

@WSDeWitt sure I'll do a pull request later today. It would be great if you can test it on your data first though (I'm unsure whether the change I'm suggesting is correct)

Andrea