sagnikbanerjee15 / Finder

A fully automated gene annotator from RNA-Seq expression data
MIT License
51 stars 14 forks source link

really long introns create a problem with gffread #53

Open jkreplak opened 2 years ago

jkreplak commented 2 years ago

Hi, My Finder crashed during FindCDS at the codan steps. Codan throw an error about duplicate key :

ValueError: Duplicate key 'chr6:1-94332053.99_10_covsplit.0'

After checking this error, i found in the gtf file combined_split_transcripts_with_bad_SJ_redundancy_removed.gtf around 650 transcripts with ultra-long introns :

chr1    FINDER  transcript      46339202        72807906        1000    +       .       gene_id "chr1.24374_0_covsplit"; transcript_id "chr1.24374_0_covsplit.0"; FPKM "90.359299"; TPM "675.727723"; cov "3011.853909"; 
chr1    FINDER  exon    46339202        46340469        1000    +       .       gene_id "chr1.24374_0_covsplit"; transcript_id "chr1.24374_0_covsplit.0"; FPKM "90.359299"; TPM "675.727723"; cov "3011.853909"; 
chr1    FINDER  exon    72807879        72807906        1000    +       .       gene_id "chr1.24374_0_covsplit"; transcript_id "chr1.24374_0_covsplit.0"; FPKM "90.359299"; TPM "675.727723"; cov "3011.853909"; 

When I relaunch gffread, i can see that the software is creating at least two fasta entry for this transcript, explaining the duplicate message of codan. After checking the web, it seems that gffread has an intron limit size.

How can I go around that to finish the pipeline ?

Thanks, Jonathan

sagnikbanerjee15 commented 2 years ago

Hello @jkreplak,

Thank you so much for your interest in finder and thank you for reporting this issue. I have not encountered this issue before and I need to think of a solution. I will keep you posted.

Thank you.

sagnikbanerjee15 commented 2 years ago

Hello,

Could you please repost the question? For some reason, I cannot find it on Github.

Thank you. Sagnik Banerjee (Ph.D.) Iowa State University Applied Bioinformatics Scientist Bristol Myers Squibb

*"The moment I have realized God sitting in the temple of every human body, the moment I stand in reverence before every human being and see God in him

On Thu, Mar 24, 2022 at 12:58 PM jkreplak @.***> wrote:

Hi, I've managed to create with agat my own exon-based transcripts fasta and comment gffread to force finder to use it. It's finished, so that's a really good news ! However, I'm surprised by the results. Why does the CDS stop before the stop codon base and not at the end of it in the gtf ?

Is that a bug on my side ? If not, the official ontology for CDS in the SO include the stop codon and it should be added.

— Reply to this email directly, view it on GitHub https://github.com/sagnikbanerjee15/Finder/issues/53#issuecomment-1077892827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFS7BOMHPJFF4FISDAJHSQ3VBSUMXANCNFSM5RQPMPFQ . You are receiving this because you were assigned.Message ID: @.***>

jkreplak commented 2 years ago

Hello, So I was able to modify findCDS command to use another tool (agat) to create fasta and it worked till the end. There is a few really large introns ( > 6Mb) caused by bad UTR splicing in some isoforms. I'll remove them by hand. I was just surprised by the final gtf format. As the CDS doesn't include stop codon and there are no stop codons lines for each transcript, tools (gffread, agat, jcvi...) are calling them 3' partial. It's difficult to work with the way I do it usually.

sagnikbanerjee15 commented 2 years ago

Hello @jkreplak,

Thank you for reporting this issue. I will create a program that will you can use to filter out certain genes and transcripts. Could you please post a few examples of the output where stop codons are missing? Since the result is being output in GTF mode, there will not be any STOP codon annotated. But later versions of the software will have the option to request for such annotations as well.

Thank you.