merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
441 stars 146 forks source link

get_non_outliers returns all outliers for uniform distribution #629

Closed ShaiberAlon closed 7 years ago

ShaiberAlon commented 7 years ago

@meren, if I understand this correctly then median_absolute_deviation will only be zero for a uniform distribution, but why did you choose to define the entire vector as all outliers? Were you maybe thinking of the case of all zeros? If so, then I suggest to change this line: https://github.com/merenlab/anvio/blob/76c8f51a17c96a9530df67da384a24da032474a6/anvio/sequence.py#L197

To:

     if not median_absolute_deviation
         if values[0] == 0:
             # A vector of all zeros is considered "all outliers"
             return numpy.array([True] * values.size)
         else:
             # A vector of uniform non-zero values is "all non-outliers"
             # This could be important for silly cases (like in megahit) in which there is a maximum value for coverage
             return numpy.array([False] * values.size)
ShaiberAlon commented 7 years ago

(Also notice that I suggest to change this: https://github.com/merenlab/anvio/blob/76c8f51a17c96a9530df67da384a24da032474a6/anvio/sequence.py#L198 to:

return numpy.array([True] * values.size)

So that the type of the object returned from this function is consistent (and always a numpy array and never a list)

meren commented 7 years ago

This was a mistake on my part, and these changes look good to me, Alon. Thank you very much for catching it and providing a resolution.