rki-mf1 / covsonar

A database-driven system for handling genomic sequences of SARS-CoV-2 and screening genomic profiles.
GNU General Public License v3.0
6 stars 0 forks source link

Fix match command #114

Closed silenus092 closed 4 months ago

silenus092 commented 1 year ago
  1. Fix deletion query

Change from

conditions.append("variant.start <= ?")
conditions.append("variant.end >= ?")

To

conditions.append("variant.start >= ?")
conditions.append("variant.end <= ?")
  1. Correct the combination of boolean ops used in the mutation query. --profile S:del:69:1 ORF8:D63N
    HAVING (mutation_1 >= 1
           OR (mutation_1 >= 1
               AND mutation_2 >= 1)))

    But it should be something like this.

    HAVING (mutation_1 >= 1
               AND mutation_2 >= 1)))
matthuska commented 1 year ago

I was thinking about the deletion conditions and I think they are correct in the original code. The idea is to accept any deletion that exactly matches the query, or where the query is contained within a deletion. So something like this where ^ indicates positions in the match query and the first line is the sequence itself:

ACGT----ACGT
    ^^^^   = a perfect match covering the whole deletion
     ^^  = also a match, as the query is within the deletion

Does that make sense?

matthuska commented 1 year ago

I should also mention that I'm pretty sure this is an intended new behavior in covsonar 2, so we should expect differences between covsonar 1 and 2 with respect to deletion queries if we use the default deletion match notation (e.g. S:del:69-70). If we want the same behavior as covsonar 1 we can use the exact deletion notation (S:del:=69-=70).

matthuska commented 1 year ago

@silenus092 can you remove the part 1 changes and keep the part 2 changes? Can you also add a new test case to ensure part 2 is working properly?

silenus092 commented 1 year ago

I was thinking about the deletion conditions and I think they are correct in the original code. The idea is to accept any deletion that exactly matches the query, or where the query is contained within a deletion. So something like this where ^ indicates positions in the match query and the first line is the sequence itself:

ACGT----ACGT
    ^^^^   = a perfect match covering the whole deletion
     ^^  = also a match, as the query is within the deletion

Does that make sense?

Hi Matt @matthuska , I got the idea for matching the subset within the deletion range.

However, I am still not convinced that the original code was correct because

for example; we have 5 mutations at ORF1ab genes; del:81-86 del:82-84 del:82-85 del:83-86 del:84-85

--> If we query del:82-84 Original code variant.start <= 81 AND variant.end >= 84 (note: we minus one at the start position)

This will also give us a del:82-85 mutation (hit by variant.end >= 84), but we want deletion in 82-84 only.

New code variant.start >= 81 AND variant.end <= 84

--> If we query del:81-86 Original code variant.start <= 80 AND variant.end >= 86 (note: we minus one at the start position)

This will give us only the del:81-86 mutation (because of '=80' and '=86' ), but it will not include del:82-84,del:82-85, del:83-86 and del:84-85 because 81 82 83 84 positions are all greater than 80

New code variant.start >= 80 AND variant.end <= 86

Not just the amino acid query, but it will give us the wrong result at the NT query level. --> If we query del:9624-9629 Original code variant.start <= 9623 AND variant.end >= 9629 (note: we minus one at the start position) This will give us only the del:9624-9629 mutation but not include del:9624-9628 (9628 is less than 9629, but at the generated query, we look for the end position that is greater than 2629)

New code variant.start >= 9623 AND variant.end <= 9629

Let me know if I am still mistaken or not understand something correctly.

matthuska commented 1 year ago

The way the code is implemented now, the query must be identical to or a subset of the actual deletion. I think that matches what is explained in the documentation:

Deletion notations are greedy by default. For example, the notation del:12 corresponds to any deletion that affects reference position 12. To fix a position, an equal sign (=) can be prepended. For instance, the notation del:=12 matches only 1bp deletions at reference position 12. Likewise, the notation del:1500-=1505 matches deletions that span at least reference positions 1500 to 1505 and definitely terminate at position 1505.

If the match has to span all of those positions, then we are not just looking for overlaps between the query and sequence. We are only allowing exact matches as well as queries that are a subset (= the match region is smaller than the actual deletion, but is completely contained within it).

I don't know if that is the intended behavior, but the documentation matches the code (I think).

@stephan-fuchs what do you think? Do we want to allow all overlaps? Or just a subset?

matthuska commented 1 year ago

Don't forget to enable our pre-commit hooks so you don't miss these types of formatting problems in the future: https://github.com/rki-mf1/covsonar/blob/dev/covsonar2-alpha/CONTRIBUTING.md#tldr-i-want-to-start-hacking-now

$ pre-commit install