Tximeta for novel transcripts by stringite

karlaarz commented 1 year ago

Hello,

I am trying to run Swish on a novel transcript assembly. I tried to import some Salmon data using the gtf from novel transcripts that I obtained from StringTie. These novel transcripts have "MSTRG" as prefix. I import the data as following:

#Genome setting----
indexDir <- file.path(dir, "salmon_idx_both_zebra")
fasta <- file.path(dir, "both_zebra.fa")
gtf <- file.path(dir, "zebrafish_merged_both.gtf")
makeLinkedTxome(indexDir=indexDir, 
                source="de-novo", 
                organism="Danio rerio",
                release="105", genome="GRCz11", fasta=fasta, gtf=gtf, write=FALSE)

#tximeta----
se <- tximeta(coldata, type = "salmon", useHub=FALSE, skipMeta=TRUE, ignoreTxVersion = TRUE)

>head(rownames(se))
[1] "ENSDART00000189431.1" "ENSDART00000189226.1" "ENSDART00000172037.2" "ENSDART00000165410.2"
[5] "ENSDART00000163675.2" "ENSDART00000172374.2"

However, when I search for these novel transcripts, they seem to be missing.

>grep("MSTRG", rownames(se))
integer(0)

I am using tximeta v1.14.1 and the R version 4.2.0 (2022-04-22).

Any help would be appreciated.

Thanks

mikelove commented 1 year ago

You'd want to check that these transcripts are both in the FASTA that you used to make a Salmon index (and hence in the Salmon quant files), and in the GTF.

E.g.

grep MSTRG sample1/quant.sf | head
grep MSTRG zebrafish_merged_both.gtf | head

karlaarz commented 1 year ago

Hi Mike, thanks for your quick response.

Yes, these novel transcripts are in the FASTA, the quant files and in the GTF:

grep MSTRG both_zebra.fa | head
>MSTRG.4.2
>MSTRG.4.4
>MSTRG.2.1
>MSTRG.9.2
>MSTRG.10.2
>MSTRG.6.1
>MSTRG.5.1
>MSTRG.7.1
>MSTRG.7.2
>MSTRG.19.3

grep MSTRG sample1/quant.sf | head
MSTRG.4.2       2476    2476.981        0.000000        0.000
MSTRG.4.4       2610    2645.029        1.863677        131.235
MSTRG.2.1       2322    2290.501        0.610686        37.239
MSTRG.9.2       2504    2868.330        1.006466        76.856
MSTRG.10.2      2296    2782.355        0.105928        7.846
MSTRG.6.1       1289    773.007 4.567673        94.000
MSTRG.5.1       865     715.359 2.362866        45.000
MSTRG.7.1       887     910.781 0.750790        18.205
MSTRG.7.2       1066    1124.650        0.828140        24.795
MSTRG.19.3      1816    1849.839        0.000000        0.000

awk -F "\t" '$3 == "transcript" { print $0 }' zebrafish_merged_both.gtf | awk -F ";" '{print $1";"$2";"$3}' | grep MSTRG | head
1       StringTie       transcript      27690   34330   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000162928"; gene_name "eed"
1       StringTie       transcript      27814   31905   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000166052"; gene_name "eed"
1       StringTie       transcript      27849   31321   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000161955"; gene_name "eed"
1       StringTie       transcript      31787   32353   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000163891"; gene_name "eed"
1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
1       StringTie       transcript      18716   19963   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000168451"; gene_name "nfkbiz"
1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000172454"; gene_name "nfkbiz"
1       StringTie       transcript      18747   20331   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000172487"; gene_name "nfkbiz"
1       StringTie       transcript      18786   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000161190"; gene_name "nfkbiz"
1       StringTie       transcript      6408    12027   1000    -       .       gene_id "MSTRG.3"; transcript_id "ENSDART00000164359"; gene_name "rpl24"

I'm not sure if I might be missing something while importing the data.

mikelove commented 1 year ago

ignoreTxVersion = TRUE strips off whatever trails the transcript id string after .

For example if you have ENST1234.10 this becomes ENST1234. So that doesn't work for your case as it is the only thing identifying the different MSTRG transcripts.

Do you need to set this to TRUE?

mikelove commented 1 year ago

Oh I also see a bigger problem. These transcripts are named differently in the FASTA/quant.sf and in the GTF.

MSTRG.x.y in the FASTA but look what follows transcript_id in the GTF. tximeta needs to have matching transcript IDs in the FASTA and GTF, it won't do any guesswork.

karlaarz commented 1 year ago

I am not getting your point; as for instance, in the examples that I provided, I see that the name of the transcript ID MSTRG.2.1 is the same as in FASTA/quant.sf and GTF files.

And you're right, the ignoreTxVersion = TRUE is not necessary in this case.

mikelove commented 1 year ago

GTF doesn't have transcript ids of the form MSTRG.2.1

karlaarz commented 1 year ago

what about the following line of the GTF:

1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";

mikelove commented 1 year ago

Oh sorry you're right. I think I see what's happening. The code recognizes versions as numbers following the first .. It's fairly common to use . to indicate transcript versions, so . is not often in the other part of the stable identifier, as it produces an ambiguity as to what is the stable ID and what is the version. The fact that StringTie uses . for both separators (gene + txp) and (txp + version) is a bit confusing for downstream software.

One solution would be to fix this in the FASTA, just find replace MSTRG. with MSTRG- everywhere before running Salmon index. Does it take long to re-quantify the files?

This would not just help with tximeta but other tools that would recognize . as separating a version number from a stable ID.

karlaarz commented 1 year ago

Ohh awesome! Ok, I will make those changes and re-run Salmon and then tximeta and will let you know how it goes (: thanks!!

karlaarz commented 1 year ago

Hi Mike,

Substituting MSTRG. with MSTRG_ was the perfect solution. I re-ran Salmon, imported the data using tximeta and then ran swish, and everything worked perfectly.

Thanks a lot!

mikelove commented 1 year ago

Great, I will try to add this to the documentation. Thanks for the report.

mikelove commented 1 year ago

added this to the docs in b09fd02

thelovelab / tximeta

Tximeta for novel transcripts by stringite #68