sidhomj / DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data
https://sidhomj.github.io/DeepTCR/
MIT License
113 stars 40 forks source link

Clustering error #20

Closed choosehappy closed 4 years ago

choosehappy commented 4 years ago

Getting an error when using our data, after loading with:

DTCRU.Load_Data(beta_sequences=beta,v_beta=v_beta,j_beta=j_beta,class_labels=class_labels,
                sample_labels=sample_labels, counts=counts)

Training appears to have gone ok :

DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.85)

image

But clustering appears to fail: image

same error when using phenograph method, so isn't clustering approach specific. Also happens when randomly sampling:

image

Is it possible that there are some outliers produced by the clustering methods, causing "sel" to be not an integer? or perhaps there is some meta data i need to set?

Other functions appear to work okay:

image

i see #1 which has a similar error, but my data exists as a single csv file which i'm loading via pandas and chopping the necessary columns out of. as such, loading via directory doesn't appear to be an option

any ideas?

sidhomj commented 4 years ago

will try to look at this tonight. I don't think I've fully tested some of the unsupervised methods with data loaded through the Load_Data function. Will hopefully have a fix soon.

choosehappy commented 4 years ago

great, thanks! happy to help debug if useful

sidhomj commented 4 years ago

thanks, I definitely appreciate the help especially since the repository is under active development still and likely still has bugs that need ironing out.

choosehappy commented 4 years ago

not a problem, i certainly know how it is : )

On Thu, Nov 14, 2019 at 10:14 PM John-William Sidhom < notifications@github.com> wrote:

thanks, I definitely appreciate the help especially since the repository is under active development still and likely still has bugs that need ironing out.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sidhomj/DeepTCR/issues/20?email_source=notifications&email_token=ACJ3XTGOZPVVFOCS2Z2LWDTQTW5RZA5CNFSM4JNPYYI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEDJ5UA#issuecomment-554082000, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTCG7XHJ62OIJ77BJF3QTW5RZANCNFSM4JNPYYIQ .

sidhomj commented 4 years ago

I just pushed up a fix I think. Let me know if it works on your data!

choosehappy commented 4 years ago

Unforunately it didn't appear to fix it, same error

is there perhaps some other debugging information I can provide?

image

sidhomj commented 4 years ago

I would check to make sure that your sample labels and class labels are string types and not integers or floats.

choosehappy commented 4 years ago

hmmm...they look good to me, appear to be all strings:

image

sidhomj commented 4 years ago

can you send me a sampling of this csv file you're trying to run to see if I can replicate the error? jsidhom1@jhmi.edu

choosehappy commented 4 years ago

1 step ahead of you : ) already getting in contact with the data-owner to ensure no confidentiality issues. i suspect we'll be ok. will send afterward, likely in the next few hours

choosehappy commented 4 years ago

all good. just sent it

sidhomj commented 4 years ago

figured it out. the problem is you're passing lists to the Load_Data where in the docs, it says you need to pass numpy arrays.. if you put a np.array() around your inputs, the code should work.

choosehappy commented 4 years ago

yes! that totally did it, thanks

i had tried to mirror the "1-loading data" tutorial, but realize now when looking at the data quickly, a numpy lists and python lists appear similar at fast glance. my mistake, sorry about that

you may want to add in a note in that particular file, and some type checking in the loading function to help others

in particular, its quite unexpected that the first 2 functions work okay (loading + training), which i perceived as the "important" functions, but the 3rd one (clustering) doesn't, i think thats what threw me off. i would say more commonly if the data isn't in the right type of format the first or second command immediately fail with a more obvious error message

anyway, very minor comments, thanks for all your help!

On Fri, Nov 15, 2019 at 4:45 PM John-William Sidhom < notifications@github.com> wrote:

figured it out. the problem is you're passing lists to the Load_Data where in the docs, it says you need to pass numpy arrays

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sidhomj/DeepTCR/issues/20?email_source=notifications&email_token=ACJ3XTG7ZNHZL5RPSLPA63DQT272ZA5CNFSM4JNPYYI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEF2OFA#issuecomment-554411796, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTBND4XP7KFUAKGBZ7DQT272ZANCNFSM4JNPYYIQ .

sidhomj commented 4 years ago

for sure! I added some data type checking into the Load_Data function to make sure inputs are numpy arrays, thanks!

choosehappy commented 4 years ago

beautiful, and thanks again for the help!

On Sat, Nov 16, 2019 at 2:00 PM John-William Sidhom < notifications@github.com> wrote:

for sure! I added some data type checking into the Load_Data function to make sure inputs are numpy arrays, thanks!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sidhomj/DeepTCR/issues/20?email_source=notifications&email_token=ACJ3XTBKPMM5RVQAXHWYCY3QT7VHZA5CNFSM4JNPYYI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEHQ77Q#issuecomment-554635262, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ3XTFKLXECFR3Q5UC5ZELQT7VHZANCNFSM4JNPYYIQ .