sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
110 stars 40 forks source link

Recommendations for handling large datasets #83

Open leeanapeters opened 1 year ago

leeanapeters commented 1 year ago

Hi, thank you for creating this great tool!

I was wondering if you could offer some guidance on handling large datasets in the unsupervised workflow? In particular this seems to be a problem with the clustering/KNN classification steps as it seems to be prohibitively memory-expensive.

I think that downsampling is interfering with the classification accuracy so I would like to use all the data if possible.

Thanks so much for your help!

Leeana

sophiachen1 commented 11 months ago

Hi I am also using this tool with large datasets (~150k sequences). The KNN classification returns empty knn_seq.pkl and an error like below. I am wondering if you have ever encountered this error? and I suspect it may be an out-of-memory issue of KNN?


ValueError Traceback (most recent call last) /tmp/ipykernel_15992/968723552.py in ----> 1 DTCRU.KNN_Sequence_Classifier(metrics=['AUC'],plot_metrics=True,n_jobs=-1, Load_Prev_Data=True,by_class=True)

~/deeptcr/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in KNN_Sequence_Classifier(self, folds, k_values, rep, plot_metrics, by_class, plot_type, metrics, n_jobs, Load_Prev_Data) 2429 if plot_metrics is True: 2430 if by_class is True: -> 2431 sns.catplot(data=df_out, x='Metric', y='Value', hue='Classes', kind=plot_type) 2432 else: 2433 sns.catplot(data=df_out, x='Metric', y='Value', kind=plot_type)

~/deeptcr/lib/python3.7/site-packages/seaborn/_decorators.py in inner_f(*args, kwargs) 44 ) 45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 46 return f(kwargs) 47 return inner_f 48

~/deeptcr/lib/python3.7/site-packages/seaborn/categorical.py in catplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, seed, order, hue_order, row_order, col_order, kind, height, aspect, orient, color, palette, legend, legend_out, sharex, sharey, margin_titles, facet_kws, **kwargs) 3801 # so we need to define palette to get default behavior for the 3802 # categorical functions -> 3803 p.establish_colors(color, palette, 1) 3804 if kind != "point" or hue is not None: 3805 palette = p.colors

~/deeptcr/lib/python3.7/site-packages/seaborn/categorical.py in establish_colors(self, color, palette, saturation) 317 # Determine the gray color to use for the lines framing the plot 318 light_vals = [colorsys.rgb_to_hls(c)[1] for c in rgb_colors] --> 319 lum = min(light_vals) .6 320 gray = mpl.colors.rgb2hex((lum, lum, lum)) 321

ValueError: min() arg is an empty sequence