phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Fix to issue 88 with new QC Check #90

Closed DarianHole closed 5 years ago

DarianHole commented 5 years ago

Added a new QC check to address issue #88

This was accomplished by creating a dictionary of subtypes found in the final subtype and checking that against the positive kmers found

DarianHole commented 5 years ago

After this address, we should probably do a new release so that Bioconda and Galaxy can be updated with the new changes and this bugfix.

DarianHole commented 5 years ago

Currently this check is having issues with mixed subtypes which I am looking for a way to fix at the moment.

DarianHole commented 5 years ago

As for the mixed subtype result, I have an update now for that but I needed to nest a for loop to do it.

subtype = '2.2.3.1.2;2.2.3.3;2.2.3.1'
pos_subtypes_set = {'2', '2.2.3.3', '2.2.3.1', '2.2.3.1.2'}

subtype = subtype.split(';')

d = {key: 'a' for (key) in subtype}
for i in d:
    a = i.split('.')
    primary_subtypes_set = set(a[0])
    new = a[0]
    for z in range(len(a)-1):
        new = new + '.' + a[z+1]
        primary_subtypes_set.add(new)

missing_hier_subtypes = primary_subtypes_set - pos_subtypes_set
missing_nested_subtypes = (', '.join(missing_hier_subtypes))

print (missing_nested_subtypes)

>>> returns 2.2, 2.2.3

Again it works, it just has a nested for loop that I couldn't seem to get rid of. I'll push this on if it seems ok.

peterk87 commented 5 years ago

As for the mixed subtype result, I have an update now for that but I needed to nest a for loop to do it.

subtype = '2.2.3.1.2;2.2.3.3;2.2.3.1'
pos_subtypes_set = {'2', '2.2.3.3', '2.2.3.1', '2.2.3.1.2'}

subtype = subtype.split(';')

d = {key: 'a' for (key) in subtype}
for i in d:
    a = i.split('.')
    primary_subtypes_set = set(a[0])
    new = a[0]
    for z in range(len(a)-1):
        new = new + '.' + a[z+1]
        primary_subtypes_set.add(new)

missing_hier_subtypes = primary_subtypes_set - pos_subtypes_set
missing_nested_subtypes = (', '.join(missing_hier_subtypes))

print (missing_nested_subtypes)

>>> returns 2.2, 2.2.3

Again it works, it just has a nested for loop that I couldn't seem to get rid of. I'll push this on if it seems ok.

This looks like a valid solution producing the expected results. I just have a couple comments and suggestions.

Where you have:

for z in range(len(a)-1):
    new = new + '.' + a[z+1]
    primary_subtypes_set.add(new)

You could use Python foreach:

new = ''
for subtype_value in a:
    new = new + '.' + subtype_value
    # only add subtypes that aren't in the positive subtype results
    if new not in pos_subtypes_set:
        primary_subtypes_set.add(new)

Or if you prefer using an index, you could do the same without mutating a string by appending the next subtype value:

for i in range(len(a)):
    sub_subtype = '.'.join(a[0:i+1])
    if sub_subtype not in pos_subtypes_set:
        primary_subtypes_set.add(sub_subtype)

You don't need to turn the subtype split into a dict since it should be an list/iterable.

subtypes = subtype.split(';')
for st in subtypes:
    st_vals = st.split('.')
    primary_subtypes_set = set()
    for i in range(len(st_vals)):
        sub_subtype = '.'.join(st_vals[0 : i+1])
        if sub_subtype not in pos_subtypes_set:
            primary_subtypes_set.add(sub_subtype)

For the above, the inner for loop could be refactored into its own function similar to what you had before.

def missing_nested_subtypes(subtype, pos_subtypes_set):
    primary_subtypes_set = set()
    for i in range(len(st_vals)):
        sub_subtype = '.'.join(st_vals[0 : i+1])
        if sub_subtype not in pos_subtypes_set:
            primary_subtypes_set.add(sub_subtype)
    return primary_subtypes_set

Also new isn't a very good variable name since in many languages it's the keyword for creating a new object instance. Also please try using more descriptive variable names (a -> subtype_values in above code) where it makes sense for readability. Also variables named i or j are typically used for storing integers while other letters like k and v might be used for the key and value of a dictionary, respectively. I usually use x in list comprehensions when I'm feeling lazy ([x for x in whatever if condition]).

Hope that helps!

DarianHole commented 5 years ago

This should be a bit better now. I'll add it to the code when I get in on Monday

subtype = '2.2.3.1.2;2.2.3.3;2.2.3.1;1.111.11.1'
pos_subtypes_set = {'2', '2.2.3.3', '2.2.3.1', '2.2.3.1.2', '1.111.11.1'}

def subtype_sets(subtype, pos_subtypes_set):
    for i in range(len(st_vals)):
        sub_subtype = '.'.join(st_vals[0 : i+1])
        if sub_subtype not in pos_subtypes_set:
            primary_subtypes_set.add(sub_subtype)
    return primary_subtypes_set

primary_subtypes_set = set()
subtype = subtype.split(';')
for i in subtype:
    st_vals = i.split('.')
    primary_subtypes_set = subtype_sets(st_vals, pos_subtypes_set)

missing_hier_subtypes = primary_subtypes_set - pos_subtypes_set
missing_nested_subtypes = (', '.join(missing_hier_subtypes))

print (missing_nested_subtypes)

>>> returns 2.2, 1.111, 1.111.11, 1, 2.2.3

Without the print and the hard coded start.

DarianHole commented 5 years ago

I have code for either a nested loop (as committed), or two defined functions as I'll post below:

def missing_nested_subtypes(subtype, pos_subtype_set):
    primary_subtypes_set = set()

    def subtype_sets(subtype, pos_subtypes_set):
        for i in range(len(st_vals)):
            sub_subtype = '.'.join(st_vals[0 : i+1])
            if sub_subtype not in pos_subtypes_set:
                primary_subtypes_set.add(sub_subtype)
        return primary_subtypes_set

    subtype = subtype.split(';')
    for i in subtype:
        st_vals = i.split('.')
        primary_subtypes_set = subtype_sets(st_vals, pos_subtypes_set)

    missing_nested_subtypes = (', '.join(primary_subtypes_set))
    return missing_nested_subtypes

Let me know which you prefer

peterk87 commented 5 years ago

The solution in your comment is cleaner and should be easier to test. I just have a few comments:

Thanks!