lareaulab / psix

14 stars 7 forks source link

AssertionError during psix_object.run_psix #8

Closed cecileherbermann closed 1 year ago

cecileherbermann commented 1 year ago

Hi, I am using psix for smart-seq2 data that is aligned using STAR. I've successfully ran psix_object.junctions2psi and psix_object.run_psix is partly working, but gets an assertionerror in line 247 of score_functions.py.

These are the last lines of the output after which the assertionerror occurs: Successfully computed Psix score of exons. Estimating p-values. This might take a while...

Do you know why this error occurs? Sidenote: the output from psix_object.junctions2psi has three files: the psi table, mrna table and splice_junctions table. My psi table and splice_junctions table are filled, but the mrna table has no values (except for the column and row names). The input that I use are the SJ.out.tabs from STAR, a TPM matrix with the transcript names as rows and cell names as columns, based on the extracted TPM columns from the RSEM .isoforms.results file. Furthermore, I use the Cassette exon annotation that you have provided. If you want, I can send you the data or any other additional information.

Kind regards,

Cecile

image
cfbuenabadn commented 1 year ago

The mRNA matrix should not be empty. The error that you're getting occurs because there's a problem when creating the bins for estimating the empirical p-values. My guess is that both issues are related.

You should make sure that the gene names in the TPM matrix match the names in the junction annotation that you're using. If you're using the human or mouse annotations that we provided with Psix, the names should be gene symbols (NOVA1, PTBP2, etc in human; Nova1, Ptbp2, etc in mouse). For example, if your TPM matrix has ensembl IDs (ENSG000 in human or ENSMUSG000 in mouse), and you're using our annotation, you are likely to run into errors. If this is the source of the error, I'd recommend using a tool such as biomart to convert your TPM gene names to gene symbols.(https://gseapy.readthedocs.io/en/latest/gseapy_example.html#Biomart-API)

You should also, the cell IDs in your TPM matrix are identical to the names in the junction reads matrix. Both matrices should have the cell IDs as columns (not rows).

If you made sure that all of this is correct and you're still getting this error, please reopen the issue and I'll be happy to help.

cecileherbermann commented 1 year ago

Thank you very much for your quick and clear response (and of course for the development of this tool). The TPM matrix indeed has ensembl transcript IDs, and I will adjust this to the gene symbols used in the junction annotation. I will close this issue for now until I have resolved it. If I still run into the same error, I will reopen the issue. Thanks again, I really appreciate your willingness to help.

cecileherbermann commented 1 year ago

Dear Carlos,

I have made sure that the Ensembl Transcript IDs match in all the input files. Using your program GTF2psix (thank you very much for updating the code and helping me out, I am very grateful for all the help you have provided), I have created an annotation with the entire Ensembl annotation with the code you suggested: python sc_splicing_tools/GTF2psix.py --gtf Homo_sapiens.GRCh38.109.gtf.gz -o psix_annotation --gene_type_tag gene_biotype --transcript_type_tag transcript_biotype

Furthermore, the chromosome names match the annotation, but I still get the same AssertionError that I got before. I don't know what creates the error, these are all the input files that I used: https://we.tl/t-HvKDIUhwOC Unfortunately, they are too big to add as an attachment, but if you would like to receive them in another way, please let me know.

The error looks exactly the same as the error I got in the first message I posted:

image

I am looking forward to any suggestions about what could have caused this issue, thank you very much in advance!

Kind regards,

Cecile

cfbuenabadn commented 1 year ago

I'm sorry you're still having this issue. I will look into it and I will get back to you.

cfbuenabadn commented 1 year ago

Hello, I updated Psix to automatically remove cells that have all NaN or inf values. If you install the latest version of Psix (currently 0.10.6), it should work. I ran this version with your data, and it worked.

I got 82 exons with a FDR < 0.05 (101 with FDR < 0.1). I also ran psix_object.compute_modules(min_gene_threshold=15, n_neighbors=30) and I got two modules (although these are quite small).

I tracked down the error to the mrna_per_event matrix. There are some cells that either have NaN in all entries (i.e., all their values for this matrix are missing), or have NaN and inf entries only. This was causing the problem. These seem to be cells with very shallow coverage (the inf seems to be caused during the TPM --> mRNA counts step in very shallow cells). In your data in particular, there were many cells like this. After removing them, I was left with 495 cells.

I will close the issue now. If you have any questions or any more problems running Psix, please let me know. This feedback is very valuable, because it helps me discover edge cases where the program fails.

Thank you, Carlos