Closed karlaarz closed 1 year ago
You'd want to check that these transcripts are both in the FASTA that you used to make a Salmon index (and hence in the Salmon quant files), and in the GTF.
E.g.
grep MSTRG sample1/quant.sf | head
grep MSTRG zebrafish_merged_both.gtf | head
Hi Mike, thanks for your quick response.
Yes, these novel transcripts are in the FASTA, the quant files and in the GTF:
grep MSTRG both_zebra.fa | head
>MSTRG.4.2
>MSTRG.4.4
>MSTRG.2.1
>MSTRG.9.2
>MSTRG.10.2
>MSTRG.6.1
>MSTRG.5.1
>MSTRG.7.1
>MSTRG.7.2
>MSTRG.19.3
grep MSTRG sample1/quant.sf | head
MSTRG.4.2 2476 2476.981 0.000000 0.000
MSTRG.4.4 2610 2645.029 1.863677 131.235
MSTRG.2.1 2322 2290.501 0.610686 37.239
MSTRG.9.2 2504 2868.330 1.006466 76.856
MSTRG.10.2 2296 2782.355 0.105928 7.846
MSTRG.6.1 1289 773.007 4.567673 94.000
MSTRG.5.1 865 715.359 2.362866 45.000
MSTRG.7.1 887 910.781 0.750790 18.205
MSTRG.7.2 1066 1124.650 0.828140 24.795
MSTRG.19.3 1816 1849.839 0.000000 0.000
awk -F "\t" '$3 == "transcript" { print $0 }' zebrafish_merged_both.gtf | awk -F ";" '{print $1";"$2";"$3}' | grep MSTRG | head
1 StringTie transcript 27690 34330 1000 + . gene_id "MSTRG.1"; transcript_id "ENSDART00000162928"; gene_name "eed"
1 StringTie transcript 27814 31905 1000 + . gene_id "MSTRG.1"; transcript_id "ENSDART00000166052"; gene_name "eed"
1 StringTie transcript 27849 31321 1000 + . gene_id "MSTRG.1"; transcript_id "ENSDART00000161955"; gene_name "eed"
1 StringTie transcript 31787 32353 1000 + . gene_id "MSTRG.1"; transcript_id "ENSDART00000163891"; gene_name "eed"
1 StringTie transcript 18716 23837 1000 + . gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
1 StringTie transcript 18716 19963 1000 + . gene_id "MSTRG.2"; transcript_id "ENSDART00000168451"; gene_name "nfkbiz"
1 StringTie transcript 18716 23837 1000 + . gene_id "MSTRG.2"; transcript_id "ENSDART00000172454"; gene_name "nfkbiz"
1 StringTie transcript 18747 20331 1000 + . gene_id "MSTRG.2"; transcript_id "ENSDART00000172487"; gene_name "nfkbiz"
1 StringTie transcript 18786 23837 1000 + . gene_id "MSTRG.2"; transcript_id "ENSDART00000161190"; gene_name "nfkbiz"
1 StringTie transcript 6408 12027 1000 - . gene_id "MSTRG.3"; transcript_id "ENSDART00000164359"; gene_name "rpl24"
I'm not sure if I might be missing something while importing the data.
ignoreTxVersion = TRUE
strips off whatever trails the transcript id string after .
For example if you have ENST1234.10
this becomes ENST1234
. So that doesn't work for your case as it is the only thing identifying the different MSTRG transcripts.
Do you need to set this to TRUE
?
Oh I also see a bigger problem. These transcripts are named differently in the FASTA/quant.sf and in the GTF.
MSTRG.x.y
in the FASTA but look what follows transcript_id
in the GTF. tximeta
needs to have matching transcript IDs in the FASTA and GTF, it won't do any guesswork.
I am not getting your point; as for instance, in the examples that I provided, I see that the name of the transcript ID MSTRG.2.1
is the same as in FASTA/quant.sf and GTF files.
And you're right, the ignoreTxVersion = TRUE
is not necessary in this case.
GTF doesn't have transcript ids of the form MSTRG.2.1
what about the following line of the GTF:
1 StringTie transcript 18716 23837 1000 + . gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
Oh sorry you're right. I think I see what's happening. The code recognizes versions as numbers following the first .
. It's fairly common to use .
to indicate transcript versions, so .
is not often in the other part of the stable identifier, as it produces an ambiguity as to what is the stable ID and what is the version. The fact that StringTie uses .
for both separators (gene + txp) and (txp + version) is a bit confusing for downstream software.
One solution would be to fix this in the FASTA, just find replace MSTRG.
with MSTRG-
everywhere before running Salmon index. Does it take long to re-quantify the files?
This would not just help with tximeta but other tools that would recognize .
as separating a version number from a stable ID.
Ohh awesome! Ok, I will make those changes and re-run Salmon and then tximeta and will let you know how it goes (: thanks!!
Hi Mike,
Substituting MSTRG.
with MSTRG_
was the perfect solution. I re-ran Salmon, imported the data using tximeta and then ran swish, and everything worked perfectly.
Thanks a lot!
Great, I will try to add this to the documentation. Thanks for the report.
added this to the docs in b09fd02
Hello,
I am trying to run Swish on a novel transcript assembly. I tried to import some Salmon data using the gtf from novel transcripts that I obtained from StringTie. These novel transcripts have "MSTRG" as prefix. I import the data as following:
However, when I search for these novel transcripts, they seem to be missing.
I am using tximeta v1.14.1 and the R version 4.2.0 (2022-04-22).
Any help would be appreciated.
Thanks