pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
654 stars 172 forks source link

Understanding output file "pseudoalignments.ec" #312

Closed mapostolides closed 3 years ago

mapostolides commented 3 years ago

Hi @pmelsted ,

I also commented this question on a closed issue but not sure if those are monitored: https://github.com/pachterlab/kallisto/issues/207

I have a question about the meaning of the contents of the file "pseudoalignments.ec".

Based on your above description, it would seem that the number of lines in this file should be equal to the number of transcripts in the reference transcriptome fasta. However, when I use this tool, this is not the case.

The last line of the "pseudoalignments.ec" file is as follows:

1049832 353,354,359,364,368,372,381,383,384,385,386,389

This suggests to me that there should be 1049832 targets/transcripts

However, the "number of targets" in my reference transcriptome (which is the number of unique transcripts) is 237,012

This does not make sense to me, and it seems like the "pseudoalignments.ec" cannot be used to identify the transcripts based on these cryptic numbers.

Am I missing something?

For reference, I am using kallisto 0.46.1, and here is the command and output of the command

$ kallisto pseudo -i /project/6007998/maposto/reference/kallisto/gencode.v38.transcripts.fa.idx -o ./ /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R1.fastq.gz /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R2.fastq.gz

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 237,012
[index] number of k-mers: 144,492,748
[index] number of equivalence classes: 1,015,345
[quant] running in paired-end mode
[quant] will process pair 1: /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R1.fastq.gz
                             /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 1,000,000 reads, 997,477 reads pseudoaligned

Thanks, Michael

Yenaled commented 3 years ago

1049832 is an equivalence class; not a transcript. In your example, a read that has equivalence class 1049832 is compatible with transcripts #353, #354, #359, #364, #368, #372, #381, #383, #384, #385, #386, and #389.

The reason we have equivalence classes is because a read may be compatible with more than one transcript (if each read were compatible with one and only one unique transcript, then you'd see 237,012 lines in the file). So because reads may be compatible with multiple transcripts, how do we figure out transcript counts? Well, at the end of your kallisto run, an EM algorithm is used to probabilistically estimate transcript counts from equivalence classes.

Hope that makes sense!

mapostolides commented 3 years ago

Thanks @Yenaled !

That does make sense. Another question, let's say I want to know which reads correspond to the equivalence class 1049832. Currently, the file pseudoalignments.tsv provides the following information:

1049832 1

I take this to mean that 1 read maps to this equivalence class. Is there any way to find out which read this is?

Also, I notice that the number of equivalence classes provided in the file pseudoalignments.ec is 1049832, whereas the number of equivalence classes stated in kallisto's output is 1,015,345. Do you have any idea why this discrepancy might be?

Thanks, Michael

Yenaled commented 3 years ago

There is no way in "kallisto pseudo" to figure out which read an equivalence class corresponds to.

However, using the newer "kallisto bus" command with the --num option, the read number is outputted in the BUS file.

As for the discrepancy, the number of equivalence classes in the index is 1,015,345. When the index is being constructed, kallisto will look at the reference transcriptome and make some equivalence classes containing multiple similar transcripts. However, as your sequencing data is being fed into kallisto, more multi-compatibility instances will be discovered and thus more equivalence classes will be created.

mapostolides commented 3 years ago

Thanks @Yenaled , that is really helpful

One more question: I'm not actually working with single cell data, does it matter that "kallisto bus" is meant for scRNA-seq data? I really only want the pseudoalignments/ECs and read IDs and that's it

Thanks!

Yenaled commented 3 years ago

You can use "kallisto bus" for bulk RNAseq data.

If you install the beta version of kallisto (in the devel branch), you can entirely replace "kallisto pseudo" with "kallisto bus" and it should work!

mapostolides commented 3 years ago

Thanks @Yenaled !

I was wondering about the meaning of the output BUS file. I converted it to text format using the following command:

bustools text -o output..txt output.bus

CGCATGCATTAGTACCTC      GCCTAAAT        502574  1
GTATGAAAGACTTTCAAG      CTTGCAAT        426838  1
GAAGCTGGCGCCGCTTTA      TCATCCTG        353320  1
GTATTGAAGGCCGTTAAC      GTCACTGC        286122  1

Here I see the first 2 columns with sequences (which are unclear what they are), and then the thiird column I am assuming is the read number? The 4th column is not clear to me either, and is all 1's

What I am looking for is which reads (i.e. their read IDs, e.g. @R000000001 ) map to which equivalence classes, so not sure how that works.

Thanks!

sbooeshaghi commented 3 years ago

@mapostolides This video may be helpful in answering your questions: https://www.youtube.com/watch?v=LUX3ugvSPp8

mapostolides commented 3 years ago

Thhanks @sbooeshaghi that is helpful. I'm wondering whether there is any way to identify specific reads? I am using bulk RNA-seq data and the "sample" and "UMI" sequences in the first two columns seem to be specific to sc-RNA-seq and don't allow me to identify reads (e.g. @R000000001 in my FASTQ file)

Yenaled commented 3 years ago

Again, run kallisto bus with the --num option.

Then, run bustools text with the -f option to get the flag (aka num) column.

mapostolides commented 3 years ago

Thanks! The -f option gives me what I need

mapostolides commented 3 years ago

Hi @Yenaled @pmelsted ,

There seems to be some inconsistency in the results of these commands. I do the following:

1) kallisto bus --num -x SureCell -o ./ -i /project/6007998/maposto/reference/kallisto/gencode.v38.transcripts.fa.idx  /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R1.fastq.gz /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R2.fastq.gz

2) bustools text -f -o output.bus.txt output.bus

I am going to choose EC 1016857 for this example, and look at the different files to show the inconsistency. I am using a simulated FASTQ file which has the transcripts of origin labelled in the read name. It has 1 million reads.

1) output.bus.txt

TCAACCTGCCTTTTGCTG  AAAATCTC    1016857 1   173552
GATGCTCAGTCTCATACT  ACTGGTCA    1016857 1   576035
ACTTGTGGTTGGGACACG  TAGGAGAC    1016857 1   707150
AGGAAGTGGCCTTGGATC  TGTTCCAG    1016857 1   994163

-I am really only interested in the 3rd and 5th columns, which I understand to be the EC and read IDs, respectively. -I understand this to mean that 4 reads map to the EC 1016857.

2) pseudoalignments.tsv I understand this file to contain 2 columns: EC and the # of reads which map to it. However, according to this file, the EC 1016857 has the following count info:

1016857 1

This is inconsistent with 1)

3) Fastq read transcript origins Additionally, I extract read names from my fastq file, which I generated using a simulator. The 4 reads from 1) have the following origins:

"R000173552:ENST00000613083.1|ENSG00000278543.1|OTTHUMG00000188030.1|OTTHUMT00000475814.1|CTD-3018O17.6-001|CTD-3018O17.6|1904|unprocessed_pseudogene|:409:542"
"R000576035:ENST00000596440.5|ENSG00000167766.19|OTTHUMG00000182761.4|OTTHUMT00000463709.1|ZNF83-208|ZNF83|561|processed_transcript|:98:38"
"R000707150:ENST00000545872.1|ENSG00000167766.19|OTTHUMG00000182761.4|OTTHUMT00000463502.1|ZNF83-204|ZNF83|2633|protein_coding|:556:470"
"R000994163:ENST00000598479.1|ENSG00000167555.14|OTTHUMG00000156494.9|OTTHUMT00000462919.1|ZNF528-212|ZNF528|4809|retained_intron|:821:757"

4) pseudoalignments.ec However, when I look at the EC and the transcripts that map to it, I see the following:

1016857 181288,181289,181290,181291,181292,181293,181294,181295,181297

-I understand column 2 to be the transcripts which map to the equivalence class 1016857

5) transcripts.txt Therefore, if I take 181288-181297 and find those line numbers in transcripts.txt, I should find transcripts which are the same as the transcripts from which my simulated reads originated.

$ cat transcripts.txt | head -181297 | tail -10
ENST00000565374.2|ENSG00000261285.7|OTTHUMG00000176580.3|OTTHUMT00000432579.2|RP11-2L4.1-001|RP11-2L4.1|967|lncRNA|
ENST00000268613.14|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432918.1|CDH13-201|CDH13|2722|protein_coding|
ENST00000569144.5|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432919.1|CDH13-212|CDH13|583|nonsense_mediated_decay|
ENST00000562601.5|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432920.1|CDH13-206|CDH13|545|nonsense_mediated_decay|
ENST00000568770.5|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432921.1|CDH13-211|CDH13|589|nonsense_mediated_decay|
ENST00000567109.6|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432917.3|CDH13-209|CDH13|7876|protein_coding|
ENST00000565636.5|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432923.1|CDH13-207|CDH13|810|protein_coding|
ENST00000431540.7|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432924.1|CDH13-203|CDH13|831|protein_coding|
ENST00000428848.7|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432925.1|CDH13-202|CDH13|2195|protein_coding|
ENST00000539548.6|ENSG00000140945.17|OTTHUMG00000176635.3|OTTHUMT00000432926.1|CDH13-204|CDH13|2199|nonsense_mediated_decay|

However, these ENST/ENSG IDs are completely different, and are on chromosome 16, whereas the reads which map to the transcripts of origin are from chromosome 19. The genes also seem to be unrelated. It seems that numbers/identifiers are getting mixed up somehow in the kallisto output files. My feeling is that among these different files, the EC numbers are different somehow, but it isn't clear how or whether this is the case

I have used the same reference transcriptome both to generate the simulated reads, and also to use as reference for running kallisto bus.

Any insights into this and what I might be missing would be much appreciated!

Thanks and best wishes, Michael

Yenaled commented 3 years ago

Is your data SureCell data? If not, you should not be using "-x SureCell" when running kallisto bus...

mapostolides commented 3 years ago

I need to choose something for "-x" right? It is a required argument. Which option should I choose?

Yenaled commented 3 years ago

In the devel branch (aka the beta version of kallisto), -x is not a required argument. In one of my comments above, I mentioned that you can entirely replace "kallisto pseudo" with "kallisto bus" in the devel branch. In your case (with paired-end reads), you'd do:

kallisto bus --num --paired -o ./ -i /project/6007998/maposto/reference/kallisto/gencode.v38.transcripts.fa.idx /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R1.fastq.gz /project/6007998/maposto/MODULES/simReads/sim_sample.names_truth.tpm1.1mil_lib_R2.fastq.gz

mapostolides commented 3 years ago

Thanks @Yenaled , I installed the devel branch of kallisto and was able to run the above command!

I still have some questions about discrepancies in the equivalence classes. Consider the following 4 reads from the bus file. I have filled in the R + 0's so that it corresponds to the read_id from the fastq file.

               seq1 seq2      ec num    read_id
2  AAAAAAAAAAAAAAAA TRUE  290758   1 R000020151
8  AAAAAAAAAAAAAAAA TRUE  290758   1 R000185187
17 AAAAAAAAAAAAAAAA TRUE  290758   1 R000393928
31 AAAAAAAAAAAAAAAA TRUE  290758   1 R000811410

Columns 3 and 5 are the equivalence class, and read IDs, respectively. I can confirm which transcripts each read originates from since I've included the transcript of origin in the read names when generating the simulated fastq. For the above 4 reads, they all come from the gene DPM2.

Now, I want to take equivalence class 290758 and find which line in the file matrix.ec corresponds to this equivalence class. When I look at this file, I see the following for EC 290758:

290758  178601,178608,178613,178620,178624,178634,178635,178636,178644,178651,178652

If I understand correctly, this means that if I look at the file transcripts.txt, then I should be able to map these transcript IDs "178601"... etc, back to line numbers in that file. However, all of transcripts on lines 178601-178652 of transcripts.txt correspond to gene NDRG4, which is not present in the names of any of the reads mapped to equivalence class 290758.

So, my question is: How do I map EC values between output.bus and matrix.ec?

Yenaled commented 3 years ago

A couple of thoughts:

The read numbers are stored in the BUS file starting at 0. So the first read in the FASTQ file is given the number 0, the second read in the FASTQ file is given the number 1, etc. So be very, very careful when looking back into the FASTQ file when you take the read number from the BUS file (e.g. what you think is read R000020151 is probably actually read R000020152).

Secondly, the transcripts in transcripts.txt are also 0-indexed: the first line on that file is actually transcript #0, the second line in that file is transcript # 1, etc.

If this doesn't resolve the issue, can you upload all the files and send it to me via email? I'd be happy to take a closer look (click on my GitHub username and you should see my email in my profile).

mapostolides commented 3 years ago

Hi @Yenaled ,

The FASTQ file is also 0-indexed (i.e. starting with R000000000), so the indexing is the same

For transcript indexing, the 0/1 indexing doesn't explain why the transcripts don't match in all cases. I also accounted for 0/1 by looking at both.

I'll send the files to you, thanks!

Yenaled commented 3 years ago

Closing this issue; it was resolved via email (the simulated fastq files were mixed up). Both kallisto and bustools are working as expected when extracting reads from fastq files and when mapping equivalence classes to transcripts.