Closed DarianHole closed 5 years ago
After this address, we should probably do a new release so that Bioconda and Galaxy can be updated with the new changes and this bugfix.
Currently this check is having issues with mixed subtypes which I am looking for a way to fix at the moment.
As for the mixed subtype result, I have an update now for that but I needed to nest a for loop to do it.
subtype = '2.2.3.1.2;2.2.3.3;2.2.3.1'
pos_subtypes_set = {'2', '2.2.3.3', '2.2.3.1', '2.2.3.1.2'}
subtype = subtype.split(';')
d = {key: 'a' for (key) in subtype}
for i in d:
a = i.split('.')
primary_subtypes_set = set(a[0])
new = a[0]
for z in range(len(a)-1):
new = new + '.' + a[z+1]
primary_subtypes_set.add(new)
missing_hier_subtypes = primary_subtypes_set - pos_subtypes_set
missing_nested_subtypes = (', '.join(missing_hier_subtypes))
print (missing_nested_subtypes)
>>> returns 2.2, 2.2.3
Again it works, it just has a nested for loop that I couldn't seem to get rid of. I'll push this on if it seems ok.
As for the mixed subtype result, I have an update now for that but I needed to nest a for loop to do it.
subtype = '2.2.3.1.2;2.2.3.3;2.2.3.1' pos_subtypes_set = {'2', '2.2.3.3', '2.2.3.1', '2.2.3.1.2'} subtype = subtype.split(';') d = {key: 'a' for (key) in subtype} for i in d: a = i.split('.') primary_subtypes_set = set(a[0]) new = a[0] for z in range(len(a)-1): new = new + '.' + a[z+1] primary_subtypes_set.add(new) missing_hier_subtypes = primary_subtypes_set - pos_subtypes_set missing_nested_subtypes = (', '.join(missing_hier_subtypes)) print (missing_nested_subtypes) >>> returns 2.2, 2.2.3
Again it works, it just has a nested for loop that I couldn't seem to get rid of. I'll push this on if it seems ok.
This looks like a valid solution producing the expected results. I just have a couple comments and suggestions.
Where you have:
for z in range(len(a)-1):
new = new + '.' + a[z+1]
primary_subtypes_set.add(new)
You could use Python foreach:
new = ''
for subtype_value in a:
new = new + '.' + subtype_value
# only add subtypes that aren't in the positive subtype results
if new not in pos_subtypes_set:
primary_subtypes_set.add(new)
Or if you prefer using an index, you could do the same without mutating a string by appending the next subtype value:
for i in range(len(a)):
sub_subtype = '.'.join(a[0:i+1])
if sub_subtype not in pos_subtypes_set:
primary_subtypes_set.add(sub_subtype)
You don't need to turn the subtype split into a dict since it should be an list/iterable.
subtypes = subtype.split(';')
for st in subtypes:
st_vals = st.split('.')
primary_subtypes_set = set()
for i in range(len(st_vals)):
sub_subtype = '.'.join(st_vals[0 : i+1])
if sub_subtype not in pos_subtypes_set:
primary_subtypes_set.add(sub_subtype)
For the above, the inner for loop could be refactored into its own function similar to what you had before.
def missing_nested_subtypes(subtype, pos_subtypes_set):
primary_subtypes_set = set()
for i in range(len(st_vals)):
sub_subtype = '.'.join(st_vals[0 : i+1])
if sub_subtype not in pos_subtypes_set:
primary_subtypes_set.add(sub_subtype)
return primary_subtypes_set
Also new
isn't a very good variable name since in many languages it's the keyword for creating a new object instance. Also please try using more descriptive variable names (a
-> subtype_values
in above code) where it makes sense for readability. Also variables named i
or j
are typically used for storing integers while other letters like k
and v
might be used for the key and value of a dictionary, respectively. I usually use x
in list comprehensions when I'm feeling lazy ([x for x in whatever if condition]
).
Hope that helps!
This should be a bit better now. I'll add it to the code when I get in on Monday
subtype = '2.2.3.1.2;2.2.3.3;2.2.3.1;1.111.11.1'
pos_subtypes_set = {'2', '2.2.3.3', '2.2.3.1', '2.2.3.1.2', '1.111.11.1'}
def subtype_sets(subtype, pos_subtypes_set):
for i in range(len(st_vals)):
sub_subtype = '.'.join(st_vals[0 : i+1])
if sub_subtype not in pos_subtypes_set:
primary_subtypes_set.add(sub_subtype)
return primary_subtypes_set
primary_subtypes_set = set()
subtype = subtype.split(';')
for i in subtype:
st_vals = i.split('.')
primary_subtypes_set = subtype_sets(st_vals, pos_subtypes_set)
missing_hier_subtypes = primary_subtypes_set - pos_subtypes_set
missing_nested_subtypes = (', '.join(missing_hier_subtypes))
print (missing_nested_subtypes)
>>> returns 2.2, 1.111, 1.111.11, 1, 2.2.3
Without the print and the hard coded start.
I have code for either a nested loop (as committed), or two defined functions as I'll post below:
def missing_nested_subtypes(subtype, pos_subtype_set):
primary_subtypes_set = set()
def subtype_sets(subtype, pos_subtypes_set):
for i in range(len(st_vals)):
sub_subtype = '.'.join(st_vals[0 : i+1])
if sub_subtype not in pos_subtypes_set:
primary_subtypes_set.add(sub_subtype)
return primary_subtypes_set
subtype = subtype.split(';')
for i in subtype:
st_vals = i.split('.')
primary_subtypes_set = subtype_sets(st_vals, pos_subtypes_set)
missing_nested_subtypes = (', '.join(primary_subtypes_set))
return missing_nested_subtypes
Let me know which you prefer
The solution in your comment is cleaner and should be easier to test. I just have a few comments:
missing_nested_subtypes
; also you can just return the joined string);
to be consistent with the subtype value.(', '.join(primary_subtypes_set))
aren't necessary and if you or someone adds a comma, a tuple could be returned instead of a string. Thanks!
Added a new QC check to address issue #88
This was accomplished by creating a dictionary of subtypes found in the final subtype and checking that against the positive kmers found