same gene_id in distant genomic locations

GrantHov commented 5 years ago

After performing stringtie merge some transcripts appear in several distant genomic locations under the same gene_id. Below is an example (sorry if its too long). So gene_id MSTRG.11 is repeated many time in different transcripts.

I use v .1.3.6 . My command is:

stringtie --merge -o calb_merged.gtf  -v -G ../../../ref_gen/C_alb_A.gff assemblies.txt

Ca22chr1A_C_albicans_SC5314 StringTie   transcript  12163   468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    12163   13751   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.1"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    468090  468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.1"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  12163   13701   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00060W_A-T"; gene_name "TUP1"; ref_gene_id "C1_00060W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    12163   13701   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00060W_A-T"; exon_number "1"; gene_name "TUP1"; ref_gene_id "C1_00060W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  12163   14917   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.3"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    12163   12648   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.3"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13075   13816   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.3"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13868   14917   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.3"; exon_number "3"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  12163   14917   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.4"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    12163   12551   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.4"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13453   13816   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.4"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13868   14917   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.4"; exon_number "3"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  12190   28401   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.5"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    12190   13760   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.5"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    27344   28401   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.5"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  13778   15033   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.6"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13778   15033   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.6"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  13778   14917   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00070W_A-T"; gene_name "MVD"; ref_gene_id "C1_00070W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13778   13816   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00070W_A-T"; exon_number "1"; gene_name "MVD"; ref_gene_id "C1_00070W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    13868   14917   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00070W_A-T"; exon_number "2"; gene_name "MVD"; ref_gene_id "C1_00070W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  17338   18960   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00110W_A-T"; gene_name "CCT8"; ref_gene_id "C1_00110W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    17338   18960   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00110W_A-T"; exon_number "1"; gene_name "CCT8"; ref_gene_id "C1_00110W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  17338   28226   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.9"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    17338   19014   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.9"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    28217   28226   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.9"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  22270   25326   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.10"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    22270   23665   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.10"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    24524   25326   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.10"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  22270   28226   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.11"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    22270   25384   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.11"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    28217   28226   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.11"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  22270   25326   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00140W_A-T"; gene_name "KEL1"; ref_gene_id "C1_00140W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    22270   25326   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00140W_A-T"; exon_number "1"; gene_name "KEL1"; ref_gene_id "C1_00140W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  31842   468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.13"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    31842   32210   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.13"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    32361   32608   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.13"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    468090  468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.13"; exon_number "3"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  31842   469045  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.14"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    31842   32604   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.14"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    468090  469045  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.14"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  31842   32444   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00180W_A-T"; gene_name "RPL16A"; ref_gene_id "C1_00180W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    31842   32444   1000    +   .   gene_id "MSTRG.11"; transcript_id "C1_00180W_A-T"; exon_number "1"; gene_name "RPL16A"; ref_gene_id "C1_00180W_A"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  32248   468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.16"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    32248   32604   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.16"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    468088  468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.16"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  32248   515735  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.17"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    32248   32604   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.17"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    514393  515735  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.17"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  32248   514413  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.18"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    32248   32608   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.18"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    514393  514413  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.18"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  32290   468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.19"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    32290   32612   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.19"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    468090  468115  1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.19"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  41789   43509   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.20"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    41789   42755   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.20"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    43302   43509   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.20"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  42123   48359   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.21"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    42123   42710   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.21"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    46115   48359   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.21"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  42155   48359   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.22"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    42155   42678   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.22"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    46067   48359   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.22"; exon_number "2"; 
Ca22chr1A_C_albicans_SC5314 StringTie   transcript  42155   48359   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.23"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    42155   42408   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.23"; exon_number "1"; 
Ca22chr1A_C_albicans_SC5314 StringTie   exon    45708   48359   1000    +   .   gene_id "MSTRG.11"; transcript_id "MSTRG.11.23"; exon_number "2";

GrantHov commented 4 years ago

Any update on this issue?

gpertea commented 4 years ago

By "distant" you mean ~450 Kbases apart? I guess that may be distant for Candida albicans but not for a mammalian genome.. It's hard to see what went wrong by looking at a "merged" output of many samples -- a few spurious read alignments in one sample can ruin a "locus" for the rest of them.. I see quite a few transcripts there with a relatively large "intron", e.g. MSTRG.11.1, MSTRG.11.13, MSTRG.11.18. Those are likely coming from read alignments in one or more samples where the aligner decided that a 450 Kbases intron is acceptable and the best it can align those reads.. If that intron is too large for your organism I think you should limit the maximum intron size allowed during the alignment (hopefully the aligner you used has that option), or (less recommended) filter out such alignments from your BAM file.

Although not recommended, you could also try to use the -j option of StringTie (when you assemble each sample) in an attempt to filter out low-coverage introns (assuming those are just rare, bad alignments) -- but that has the side effect of a loss of sensitivity (low-expression isoforms might be lost) and it might actually not help if the aligner consistently aligned multiple reads with the same large "introns" in a region (e.g. due to some short local repeats, preferring the large-intron alignments over the shorter or ungapped ones).

gpertea / stringtie

same gene_id in distant genomic locations #230