splicebox / PsiCLASS

Simultaneous multi-sample transcript assembler for RNA-seq data
16 stars 4 forks source link

Read quality scores, indels and substitution #15

Open sagnikbanerjee15 opened 3 years ago

sagnikbanerjee15 commented 3 years ago

Hello,

I was reading the methods of splice graph construction and was wondering if PsiCLASS uses quality scores at any point? Also, are indels, substitutions and soft-clippings used? I was thinking of storing only those information that would be needed by PsiCLASS and discard the rest. This might help save a lot of space. I am currently working on designing an alignment file format that would store only the bare minimum information necessary for generating assembly.

Thank you.

mourisl commented 3 years ago

Do you mean to save space in the BAM file? I don't think I used the quality score in PsiCLASS. However, indels, soft-clipping in the CIGAR string are quite important to determine the exact boundary of a read and the subexons it covers. For the optional field in the BAM file, I think the NM (number of mismatches) and NH (number of hits, to determine multiple-aligned reads) are also used in PsiCLASS.

sagnikbanerjee15 commented 3 years ago

Yes, I wish to save space and I think there might be a better way to do some lossy compression without losing out on relevant information. I foresee that there will be a need to store aligned files which can prevent us from realigning reads and still store the previous information. All I am attempting to do is perform the lossy compression and back as quickly as possible. Can PsiCLASS be modified to accept an alignment file (other than bam).

Thanks.

mourisl commented 3 years ago

So far, it only supports the BAM format. Have you checked CRAM format? I think I can update PsiCLASS to support CRAM format.

sagnikbanerjee15 commented 3 years ago

Yes, I looked into CRAM and I think there is a way to reduce storage even further. Does PsiCLASS use the seq field?

Thanks.

mourisl commented 3 years ago

Yes, it does. It uses sequence field to determine GC content, and to determine the read length in some part. you can put them as all "A"s to save space, the GC bias correction should not affect the performance much.

Have you checked Boiler, another compression method for RNA-seq alignments?

sagnikbanerjee15 commented 3 years ago

Yes I am currently looking into boiler.

Thanks.