Closed ssnn-airr closed 2 years ago
Original comment by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
<div class="preview-container wiki-content"><!-- loaded via ajax --></div>
<div class="mask"></div>
</div>
Original comment by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
I attached a MRE.
I hacked presto
and asked collectEEQueue
in EstimateError.py
to output dist
and dist_df
before line 367, the offending error, during an actual run that was experiencing the error.
The error can then be reproduced locally as well:
import pickle
import pandas as pd
import numpy as np
with open('debug_EE_dist.pkl', 'rb') as input:
dist = pickle.load(input)
with open('debug_EE_dist_df.pkl', 'rb') as input:
dist_df = pickle.load(input)
# line 367
# error happens
thresh_df = pd.DataFrame.from_dict({'thresh': {'ALL': dist_df.index[np.argmax(dist) + \
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)]) \
if dist[index] == np.min(dist)]))]}
})
:confused:
Original comment by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
Man, that thresh_df
line is insane!
Traced the error down to:
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])
if dist[index] == np.min(dist)]))
Apparently,
`[index for index in np.argsort(dist[:int(len(dist)*0.75)])
if dist[index] == np.min(dist)]
` evaluates to []
.
As a result, int(np.mean([]))
complains about ValueError: cannot convert float NaN to integer
.
Looking at the list comprehension,
np.min(dist)
evaluates to 0.0
The list comprehension evaluates to an empty list because
for index in np.argsort(dist[:int(len(dist)*0.75)]):
print(dist[index])
2204.0
2844.0
3120.0
3887.0
4270.0
4771.0
5883.0
9013.0
9371.0
9694.0
15772.0
16762.0
18100.0
20350.0
25545.0
25753.0
40180.0
41889.0
45282.0
45910.0
55368.0
56283.0
58865.0
63356.0
64246.0
64574.0
68575.0
69425.0
104577.0
134327.0
267810.0
378157.0
494295.0
893743.0
1436839.0
2773445.0
14354059.0
None of which equates np.min(dist)
Original comment by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
Not sure if it matters:
I got this error when
-n 20
-n 5
I did not get an error when
-n 20
(but with my data that seems useless an estimation because there was only 1 set with >20 sequences that “passed“)
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
Potentially relevant pandas change in v0.24 that added support for NaN
in integer vectors.
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
(And, yeah, that thresh_df
line has some readability issues…)
Also, potentially relevant as a stop-gap solution, you can specific -f
to SplitSeq-sample to randomly sample evenly from sequences with the same value in the specified field. Eg, -f UMI -n 100
will sample up to 100 sequences from each unique UMI. You will change the read distribution, but it’s a good way to compress huge UMI groups that dominate the read counts.
Original comment by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
Thanks. I tried the stop-gap, but the same error occurred. The calculation itself (-f INDEX_UID -n 100 ==> total 3444378 sets) was only under 10 min (according to the progress bar). Still, cool trick about SplitSeq sample
that I didn’t know about!
So is int(np.mean([])) supposed to be 0
in such a case? If so, np.argmax(dist)
evaluates to 0
too here; and so dist_df.index[np.argmax(dist)+0]
evaluates to 0.0
, which becomes the value of THRESH: ALL
. Does that mean thresh is supposed to be 1-0.0 = 1 here? Which basically means there’s no point to cluster by sequences?
I’d debug more but I don’t know what exactly thresh_df
is trying to do here, and the rest of EE.py is quite complex too..
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
I’m not sure. I’ll have to actually look at it. EE is a mix of very old code by me and some stuff Roy added later for the UMI collision methods. I’ll take a look a bit later… (day job keeps me pretty busy these days).
I seems feasible that the mean distance is indeed zero and it’s just not handling it properly. My gut would be some sort of numpy/pandas version issue if that’s case, because that should’ve popped up before.
Original comment by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
Phew. I spent some time investigating this as I (thought I) needed this to work for a current project. I don’t think it’s a numpy/pandas version issue. The core problem, if I’m not wrong, lies in the logic written in that thresh_df
line. In addition to that, I also identified multiple other errors in the EE-related code. It’s a bit too complicated to explain them here, so I put them in this shared Benchling page.
Also tagging @ruoyijiangyale since I think this is essentially his code?
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
Very nice. Yeah, @{557058:70d83662-347f-4227-9460-d059892b4588} , is probably the one to fix this issue. IIRC, there’s an assumption of a binomial distribution for pairwise distances (based on what you would get with random 4 characters strings) and those threshold windows are derived from that.
Original comment by Roy Jiang (Bitbucket: ruoyijiangyale, ).
Hey Julian,
Welcome to this corner of presto. This looks like a bug in the code that you caught.
#This...
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist)]))
#Should be...
int(np.mean([index for index in np.argsort(dist[:int(len(dist)*0.75)])if dist[index] == np.min(dist[:int(len(dist)*0.75)])]))
#I'm guessing the cases where this was tested on
np.min(dist) == np.min(dist[:int(len(dist)*0.75)])])
#so it was missed...
#Probably should've written it...
#consider the dist up to 0.75 of the dist profile and find index of the min value
dist_window = dist[:int(len(dist)*0.75)]
np.argmin(dist_window)
Regarding bigger issues with EstimateError.py set...
The goal is to solve a 1D minimum threshold identification problem with the lowest tech method possible.
EstimateError.py set is challenging for the reasons you describe; because sequencing protocols have different error profiles and this was optimized using our protocols based on a few 2d histogram calculations from a long time ago.
The reason we didn't try so hard is because 0.8 is the widest cluster that CD-HIT-EST can cluster. So there was no point coming up with a sophisticated way to find the threshold when the open source (no vsearch/usearch!) high speed clustering technique had limitations anyways. I know from simulations, this method works though.
EstimateError.py the original was written before my time for calculating Illumina sequencing error profiles. Then we adapted it for this specific problem.
Original report by Julian Zhou (Bitbucket: [Julian Zhou](https://bitbucket.org/Julian Zhou), ).
Error:
Console log: