what is the meaning of genome-guided transcriptome assembly

Xiaofei-git commented 4 years ago

Dear there,

I see CLASS is genome-guided transcriptome assembly. What is the meaning of that? My question is CLASS a de novo assembler or annotation-based methods?

If it is annotation-based method, which option should I use to give the path for reference transcript annotation to guide the assembly? I see there is an option of "-e evidence: the path to the evidence files". I tired to give the annotation file path using this option, but the output still looks like de novo assembly, as below, not incorporate the known annotation information. I am working on Sorghum, I am expecting the gene_id and transcript_id are something like 'gene_id "gene:SORBI_3001G000100"; transcript_id "transcript:EER90453";'.

Another thing is that it takes so long to run, my job has been run for 7 days and it is still running on cluster with 12 threads. My bam file is large, which is about 0.5 T, but it took about 2 days using Stringtie on the same bam file. Any ideas about my running CLASS job?

Thanks a lot!

1 CLASS transcript 11356 14772 1000 - . gene_id "1.0"; transcript_id "1.0.0"; Abundance "2.783007"; 1 CLASS exon 11356 11531 1000 - . gene_id "1.0"; transcript_id "1.0.0"; exon_number "1"; Abundance "2.783007" 1 CLASS exon 11649 11732 1000 - . gene_id "1.0"; transcript_id "1.0.0"; exon_number "2"; Abundance "2.783007" 1 CLASS exon 11892 12152 1000 - . gene_id "1.0"; transcript_id "1.0.0"; exon_number "3"; Abundance "2.783007"

mourisl commented 4 years ago

Genome-guided means it uses the alignment of the reads to assemble the transcripts. Genome here means reference genome where the reads aligned to. CLASS is a de novo assembler in general sense.

CLASS2 processes the data chromosome by chromosome, and it will output the results when one chromosome finishes, maybe you can check the output gtf file and see which stage it is at.

If you find CLASS2 is too slow to wait, you can try the reimplemented version of CLASS in https://github.com/splicebox/PsiCLASS . PsiCLASS can still be used for single-sample transcriptome assembly, and the speed should be comparable to StringTie's.

Xiaofei-git commented 4 years ago

Genome-guided means it uses the alignment of the reads to assemble the transcripts. Genome here means reference genome where the reads aligned to. CLASS is a de novo assembler in general sense.

CLASS2 processes the data chromosome by chromosome, and it will output the results when one chromosome finishes, maybe you can check the output gtf file and see which stage it is at.

If you find CLASS2 is too slow to wait, you can try the reimplemented version of CLASS in https://github.com/splicebox/PsiCLASS . PsiCLASS can still be used for single-sample transcriptome assembly, and the speed should be comparable to StringTie's.

Thank you so much! I am trying to run class2 on our sub-samples, and I will let you know how it is going.

Xiaofei-git commented 4 years ago

Genome-guided means it uses the alignment of the reads to assemble the transcripts. Genome here means reference genome where the reads aligned to. CLASS is a de novo assembler in general sense.

CLASS2 processes the data chromosome by chromosome, and it will output the results when one chromosome finishes, maybe you can check the output gtf file and see which stage it is at.

If you find CLASS2 is too slow to wait, you can try the reimplemented version of CLASS in https://github.com/splicebox/PsiCLASS . PsiCLASS can still be used for single-sample transcriptome assembly, and the speed should be comparable to StringTie's.

You said CLASS2 processes data chromosome by chromosome. My question is does the number of chromosomes affect the speed of class2 to assemble? I work on sorghum which has 10 chromosomes, but there are also supercontigs in my reference which was used to do alignment for bam file. The total number of sequences is 867 in my reference. My second question is about PsiCLASS, if I use it, does it compatible for Mikado for downstream analysis? Thanks a lot!

mourisl commented 4 years ago

The number of chromosomes would not affect the speed of class2. The major factor affecting the speed the complexity of a gene. If a gene structure gets too complicated, such as due to alignment artifact, the intron splice sites is everywhere for a gene, class2 will take a lot of time to compute.

As for PsiCLASS, it seems that Mikado only requires GTF files as input, so I think the output of PsiCLASS's gtf files would work for Mikado.

Xiaofei-git commented 4 years ago

The number of chromosomes would not affect the speed of class2. The major factor affecting the speed the complexity of a gene. If a gene structure gets too complicated, such as due to alignment artifact, the intron splice sites is everywhere for a gene, class2 will take a lot of time to compute.

As for PsiCLASS, it seems that Mikado only requires GTF files as input, so I think the output of PsiCLASS's gtf files would work for Mikado.

I am trying to install PsiLASS. But, I got error after running "make". Here is the error as below. Actually, after "git clone", I see the executable "psiclass" in the downloaded directory even WO running "make". My question is could I use it directly without running "make"? We see the psiclass executable file is suspiciously small? Is it related to operating system to make successfully run?

.... gcc -c -g -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_USE_KNETFILE -D_CURSES_LIB=0 -I. bam_tview_html.c -o bam_tview_html.o gcc -g -Wall -O2 -o samtools bam_tview.o bam_plcmd.o sam_view.o bam_rmdup.o bam_rmdupse.o bam_mate.o bam_stat.o bam_color.o bamtk.o kaln.o bam2bcf.o bam2bcf_indel.o errmod.o sample.o cut_target.o phase.o bam2depth.o padding.o bedcov.o bamshuf.o bam_tview_curses.o bam_tview_html.o libbam.a -Lbcftools -lbcf -lm -lz -lpthread phase.c:475: error: undefined reference to 'gzopen64' bedcov.c:69: error: undefined reference to 'gzopen64' bam_import.c:76: error: undefined reference to 'gzopen64' bam_import.c:126: error: undefined reference to 'gzopen64' collect2: error: ld returned 1 exit status make[2]: [samtools] Error 1 make[2]: Leaving directory `/mnt/grid/ware/hpc/home/data/xwang/software/PsiCLASS/PsiCLASS/samtools-0.1.19' make[1]: [all-recur] Error 1 make[1]: Leaving directory `/mnt/grid/ware/hpc/home/data/xwang/software/PsiCLASS/PsiCLASS/samtools-0.1.19' make: *** [subexon-info] Error 2

Xiaofei-git commented 4 years ago

Actually, I tried to run without 'make", but it turned out it did 't work out. It complained "sh: ~/PsiCLASS/PsiCLASS/samtools-0.1.19/samtools: No such file or directory".

So, it has to run ''make' successfully.

Do you know why I got that error when I run "make"?

Xiaofei-git commented 4 years ago

Our cluster admin helped me changed the Makefile, so it works now. But, we still don't know why it complains it can't find the zlib (libz.so). Thanks!

mourisl commented 4 years ago

I think zlib usually comes with the system. If you are using on a SLURM cluster, sometimes changing the version of gcc could load the right zlib back. CLASS2 also requires that library to compile samtools-0.1.19, not sure why it worked at time.

Xiaofei-git commented 4 years ago

I think zlib usually comes with the system. If you are using on a SLURM cluster, sometimes changing the version of gcc could load the right zlib back. CLASS2 also requires that library to compile samtools-0.1.19, not sure why it worked at time.

Because we also hacked the makefile for CLASS2.

Xiaofei-git commented 4 years ago

Could you tell me in brief the difference between CLASS2 and PsiCLASS? Why the later one is so fast. Any difference between the results? I did some research. It showed me that PsiCLASS simultaneously assembles multiple RNA-seq samples, which significantly improves performance over the traditional ‘assemble-and-merge’ model. But, it is still unclear to me. I run both of them on my merged bam, which seems one sample, right? I run both CLASS2 and PsiCLASS on my subsample data (merged bam from subsamples). It is really fast to finish the assembly with PsiCLASS, only took "wallclock 12601.337, cpu 16938.461". While, the CLASS2 job is still running, and has already taken "wallclock=2:17:42:07, cpu=10:18:02:36", and I only got the assembly for chromosome1.

Thanks a lot!

mourisl commented 4 years ago

You are right about that. If you gave it as single large bam file, it would regard it as a one sample case.

CLASS2 has an enumeration due to the insert between paired-end reads, hence it is very slow if there are too much false introns. PsiCLASS first regards the paired-end read as single-end to avoid the enumeration, and in later stage uses weight to prefer the transcripts satisfying the paired-end relation. From my experience, the performance is very close.

Xiaofei-git commented 4 years ago

You are right about that. If you gave it as single large bam file, it would regard it as a one sample case.

CLASS2 has an enumeration due to the insert between paired-end reads, hence it is very slow if there are too much false introns. PsiCLASS first regards the paired-end read as single-end to avoid the enumeration, and in later stage uses weight to prefer the transcripts satisfying the paired-end relation. From my experience, the performance is very close.

Thanks a lot for your help! There is another question coming up to my mind. You said that it is very slow if there are too much false introns. My question is if my CLASS2 is very slow, do you think is it suggesting there are a lot of false introns in my assembly? Where is the false introns coming from, because of the false alignment and false junctions? How can I detect and remove them?

mourisl commented 4 years ago

Usually those introns will be lowly supported and the chosen transcripts have that intron are usually filtered out later. But CLASS2 will spend a lot of effort when picking the transcripts since they could from low-abundance transcript. Usually they are alignment artifacts, for example, for some genes CLASS2 stuck on, you may observe intron splice site at almost every base or each splice is associate with too many introns. You can manually change the "if ( splices[k].support <= 0.01 * maxUniqSupport )" at line 1838 in FindRegions.cpp to 0.05 to filter more splice site.

It just come to my mind that since you are aligning to a plant genome, is their splice motif such as GT-AG in human to determine the splice site and strand? And what aligner did you use?

Xiaofei-git commented 4 years ago

Yes, there is splice motifs in plants too, I remember the common one is also GT-AG. I used STAR to do the alignment.

mourisl commented 4 years ago

By the way, the manuscript for PsiCLASS was officially published at here

mourisl / CLASS

what is the meaning of genome-guided transcriptome assembly #1