splicebox / PsiCLASS

Simultaneous multi-sample transcript assembler for RNA-seq data
16 stars 4 forks source link

v1.0.3 crashing without a note in log #30

Closed sanyalab closed 3 months ago

sanyalab commented 1 year ago

Hi,

I am running psiclass v1.0.3. The tool runs fine for many datasets. Only sometimes it fails without throwing an error. I have followed the installation as mentioned in the README.

Tail of the log file is as below Thread 10: Chr04B 17383812 17423435 finished. Thread 13: Chr15B 8433979 8514831 finished. Thread 6: Chr06B 11481451 11510381 finished. Thread 16: Chr30B 6353608 6565124 finished. Thread 14: Chr04B 9067830 9103488 finished. Thread 7: Chr21B 12043077 12160244 finished. Thread 8: Chr30B 6579940 6889886 finished. Thread 17: Chr05B 9204076 9354827 finished. Thread 12: Chr16B 8151279 8203086 finished. Thread 5: Chr06B 16478464 16603194 finished.

Tail of the STDERR subexon-info egg_late.bam ./splice/ECB_Hap2_psic_bam_0.splice > ./subexon/ECB_Hap2_psic_subexon_0.out subexon-info female.bam ./splice/ECB_Hap2_psic_bam_1.splice > ./subexon/ECB_Hap2_psic_subexon_1.out subexon-info gut_instar_3.bam ./splice/ECB_Hap2_psic_bam_2.splice > ./subexon/ECB_Hap2_psic_subexon_2.out subexon-info instar_1.bam ./splice/ECB_Hap2_psic_bam_3.splice > ./subexon/ECB_Hap2_psic_subexon_3.out subexon-info instar_3.bam ./splice/ECB_Hap2_psic_bam_4.splice > ./subexon/ECB_Hap2_psic_subexon_4.out subexon-info instar_5.bam ./splice/ECB_Hap2_psic_bam_5.splice > ./subexon/ECB_Hap2_psic_subexon_5.out subexon-info male.bam ./splice/ECB_Hap2_psic_bam_6.splice > ./subexon/ECB_Hap2_psic_subexon_6.out subexon-info pupa.bam ./splice/ECB_Hap2_psic_bam_7.splice > ./subexon/ECB_Hap2_psic_subexon_7.out combine-subexons --ls ./subexon/ECB_Hap2_psic_subexon.list > ./subexon/ECB_Hap2_psic_subexon_combined.out classes --stranded rf -p 20 --lb BamFile_genomeB.path -s ./subexon/ECB_Hap2_psic_subexon_combined.out -o ./ECB_Hap2_psic > ./ECB_Hap2_psic_classes.log

I cannot share bam files for this dataset and I am not sure whether I can recreate the error with others since the error is unpredictable.

Thanks Abhijit

mourisl commented 1 year ago

Can you share the log file with me? Thank you.

sanyalab commented 1 year ago

Attached are the log and the STDERR files ECB_Hap2_psic_classes.log STDERR.txt

mourisl commented 1 year ago

Thank you for sharing the file. It seems there are three gene regions that hasn't finished Chr16B 4741698 4760443. Chr20B 13708701 13730914. Chr30B 4295343 4490726. Do you receive any error message from PsiCLASS, or the job gets terminated due to long running time?

You can run PsiCLASS with a lower value of --maxDpConstraintSize to make it faster. During the test, you can also add the option --stage 3 to reuse the subexon structures from previous step. Another approach is to see whether there are some strange introns in these region, and use a more stringent threshold to filter the introns.

sanyalab commented 1 year ago

Hi, I do not get any error message so I do not know how long the program ran. But this is interesting. Essentially I am trying to build a set of transcripts on each haplotype. Haplotypes are not much different from each other. Therefore, I wonder why it fails on one, but succeeds in the other one.

How does one give an intron threshold in PsiClass? Can you tell me how to get a detailed help for psiclass?

I'll start the run once I hear back from you

Thanks a lot Abhijit

mourisl commented 1 year ago

To filter introns, you can use the option "--sa" to specify the minimum average number of reads supporting an intron. You can try a higher value for the filter.

How many samples do you have when running PsiCLASS?

sanyalab commented 1 year ago

Typically I will have 8 - 10 bam files to work with.

By intron threshold I meant if there was any method to specify length of introns or any way to specify the minimum distance between read clusters.

mourisl commented 1 year ago

The long intron size is inferred by the 99% quantile of all the intron lengths, so if you reduce this value too much, you may miss some real introns. I think trying a smaller value for --maxDpConstraintSize, say 5, could be better.

I think you can check the splice/*.splice file to confirm that there are some introns should be filtered.

sanyalab commented 1 year ago

What does "--maxDpConstraintSize" do in general? I didn't quite follow. I'll check the splice file.

Thanks Abhijit

mourisl commented 1 year ago

It kind of controls how many subexons a read can be partially aligned with. In a clean gene, a read probably only spans two or three subexons, so 5 or the default value 7 is a quite safe choice in general.

sanyalab commented 1 year ago

Ah I see. Can you tell me what the fields are in the splice file? Or point me to the documentation? Could not find it.

Chr01B 198 12089 185 - 185 0 59 0 Chr01B 200 12222 4 - 4 0 12 0 Chr01B 228 12151 2 - 2 0 9 0 Chr01B 393 12418 12 - 11 1 27 2 Chr01B 436 12321 97 - 97 0 6 0 Chr01B 4678 6745 86 - 14 72 27 41

Thanks Abhijit

mourisl commented 1 year ago

The fields are chrom_name start_site end_site number_of_read_support strand other_support_information(e.g. unique alignment, number of mismatches).

sanyalab commented 1 year ago

I see.... A few additional questions

Q1: Just to confirm, the start_site and end_site are the coordinates of each intron and the "number_of_read_support" is the number of reads that span the intron. Is that Correct?

Q2: In the example below, Number_of_unique_alignments = 185, Number_of_mismatches = 0 or 59? Chr01B 198 12089 185 - 185 0 59 0

Q3: If I filter the splice file based on intron length, it will effect the subexon graph and consequently the transcript models right? In this case do I start from stage 2?

Best Regards and thank you for the prompt replies.

-Abhijit

mourisl commented 1 year ago

Q1: Yes. To be exact, start_site is the end of the left exon, and end_site the start of the right exon.

Q2: 59 is the number of mismatch from the unique alignment. The two 0's are for multi-mapped reads

Q3: You need to start from stage 1.

sanyalab commented 1 year ago

Hello,

I used the following command to rerun PSICLASS

nohup ~/SOFTWARE/PsiCLASS/psiclass --lb /data/sanyalab/TESTAREA/BamFile_genomeB.path --maxDpConstraintSize 5 --stranded rf -o ECB_Hap2_psic --stage 3 -p 20 &

I got the same result. The program crashes/quits/stops without a error report. Attached are the log files nohup.txt ECB_Hap2_psic_classes.log

Also attaching a screenshot of the files created. All the gtf files are empty. Ignore the "PREV" directory Capture8

Thanks Abhijit

mourisl commented 1 year ago

Could you please share the file subexon/abc_subexon_combined.out file and the header of the bam file ("samtools view -H xyz.bam") with me?

JohnUrban commented 3 months ago

Was this ever solved?

sanyalab commented 3 months ago

Hi John,

I never got to spend time on this, and moved away onto other things. If the situation occurs again, I'll reopen the issue.

Thanks -Abhijit