theislab / scarches

Reference mapping for single-cell genomics
https://docs.scarches.org/en/latest/
BSD 3-Clause "New" or "Revised" License
331 stars 51 forks source link

problems about scHPL training #212

Closed YawnC closed 1 year ago

YawnC commented 1 year ago

Hi~ I was trying to fulfill the reference creation process. However, as I used the provided code, which is shown as the following image I got an error saying "PCA does not support sparse input. See TruncatedSVD for a possible alternative". Then I realized it may due to the matrix comprise of 5000 genes I included. image So, I switched the matrix to the latent space (10 dims), which is like the following: image which is kind of similar to the demo data image Then I got another error: image Any suggestions?

lcmmichielsen commented 1 year ago

I think the error comes from the slight difference in input to the train_tree function. The train_tree function uses the X matrix as input instead of the AnnData object (so source_adata.X instead of source_adata).

Hope this helps! If this doesn't solve the problem, could you provide the complete trace-back and your input, so it's easier to debug?

YawnC commented 1 year ago

Hi, thanks for replying. For the previous problem I found it comes from my annotation. I made each annotation level under different clustering resolution, which means in lower hierarchy some cells may crossover and thereby belong to other clusters in higher hierarchy. This problem made the tree 'dirty'. After reannotation this problem got solved. However here comes a new problem: when I continued to the next step, update the hierarchy with new dataset, an error appeared as the following image I tried different datasets, the same problem appeared.

YawnC commented 1 year ago

And here is the full report, thank you again! image image

lcmmichielsen commented 1 year ago

Hmm, interesting. The error is caused when the tree trained on data_2 (the query data) is used to predict the labels of data_1 (the reference data). When doing pca.transform(test_data) it seems that there a no cells in the testdata which causes an error. Is the query_latent the combined latent space of the reference and query? And if so, are the labels of reference exactly called 'reference'? You can have a look at this notebook to see an example of how to concatenate the reference and query data.

YawnC commented 1 year ago

Yes I had walked through this notebook previously, and it worked well with such "one-level-annotation" (as shown below), with exact the same datasets. image

However, when I switched it to "multi-level-annotation-tree" (shown as below, which follows this notebook https://github.com/lcmmichielsen/treeArches-reproducibility/blob/main/Figure2-HLCA%20healthy/Figure2%2C%20S9-S13.ipynb), this problem comes out. image

lcmmichielsen commented 1 year ago

What do you mean exactly with this problem? Is that related to the problem that you mentioned before about the zero samples in the test data? Or is that solved and is your problem related to the figure you attached now?

YawnC commented 1 year ago

The problem "0 sample (0,30)" is about the "multi-level-annotation-tree", constructed as the picture here. https://user-images.githubusercontent.com/118878017/270230240-2586f2dd-dd4e-4f54-bf13-fd5cf05231d8.png

What I tend to say is when I abandon the multi-hierarchy annotation structure above, using the lowest hierarchy instead, the model is trained perfectly, so I guess the problem may come from the hierarchy structure?

lcmmichielsen commented 1 year ago

Good to know. Did you check these two things I mentioned earlier:

YawnC commented 1 year ago

Oh, actually not, the query_latent is only the query dataset, as I followed the GitHub reproducibility notebook. The reference label is 'reference' though. I will try full_latent first and report you the result then. Thank you for your patient and generous help!

lcmmichielsen commented 1 year ago

Okay, let me know whether this helps!

Btw, in codeblock 14 of the notebook you mentioned (https://github.com/lcmmichielsen/treeArches-reproducibility/blob/main/Figure2-HLCA%20healthy/Figure2%2C%20S9-S13.ipynb), we also merge the reference (LCA) and query (emb_M) into one object before updating the hierarchy. So there you could see another example of how you could implement it for your dataset.

YawnC commented 1 year ago

Hi, here is my issue updating: I moved the jupyter file into vscode, and picked out the package learn.py as a subprocess, here is the debugging result: image the variable data_1, data_2 and trees seem fine, but the problem is still there: image

I know it may be complex to figure out what is going on inside it as the dataset varies, so if it is too bothering just ignore my issue and close it. Thank you again!

lcmmichielsen commented 1 year ago

Hmm this is weird. Now your code also crashes at another spot right? It used to be at labels_1_pred = predict_labels(data_1, tree_2, threshold=rej_threshold), but now it's a step earlier during tree = train_tree(data_1, labels_1, tree, classifier, dimred, useRE, FN, n_neighbors, dynamic_neighbors, distkNN), right?

Do the labels you input to the learn_tree function still correspond to the labels that were already in the hierarchy?

It's quite difficult to debug, so without proper error traceback for this new problem and your input code, I am afraid I cannot help you.

YawnC commented 1 year ago

Thank you again! I will use the demo dataset instead.