nedialkova-lab / mim-tRNAseq

Modification-induced misincorporation tRNA sequencing
GNU General Public License v3.0
19 stars 14 forks source link

Interpretation of output files #65

Open pawanchk opened 12 months ago

pawanchk commented 12 months ago

Hi,

I have some questions regarding the output files generated after the mim-tRNAseq analysis -

  1. In the raw counts file counts/Anticodon_counts_raw.txt - what does the last column size refer to ? In the manual https://mim-trnaseq.readthedocs.io/en/latest/output.html, I noticed that explanation is size is given for the Isocoder output file, but not for the Anticodon output file.

  2. If I use reverse complement of the fastq files as input, then the output counts does not change - can I please know if the program considers reverse complement already ?

  3. Among the output files, where can I find the length information of the tRNA sequences that are found from the input data ?

I look forward to hearing from you.

nedialkova-lab commented 12 months ago

Hi,

  1. The definition of the size column in the Anticodon output file is the same as in Isodecoder, i.e. the number of sequences in the reference file that have a specific anticodon.

  2. The alignment settings for GSNAP are in the align.log file - you will see there that the alignment mode is set to default, i.e. both forward and reverse strand are considered.

  3. I'm not sure I understand what you mean by length information of the tRNA sequences in the input data - if this refers to the length of the mapped reads, this info can be obtained from the bam files in /align. Another useful file is RTstopTable.csv in /mods: this includes tRNA/cluster, canonical tRNA position and proportion of reads that stop at each position (normalized to total coverage of the reference sequence). This gives the relative frequency of reads stopping at all positions for a reference, the sum of which should equal 1.

pawanchk commented 12 months ago

Hi,

Thank you for your detailed response - this is very helpful.

For additional clarification - please let me provide more details for my questions 2 and 3 -

  1. Regarding the 2nd question - the input sequence file is only 1 fastq file (it is not a paired end sequencing data) and I tried to analyse it as it is and then did a second round of analysis by taking reverse complement of the same fastq file. In this case does the workflow consider both forward and reverse complement in each run ?

Regarding the alignment mode - are you referring to this from GSNAP ?

 --mode=STRING                  Alignment mode: standard (default), cmet-stranded, cmet-nonstranded,
                                    atoi-stranded, atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded.
                                    Non-standard modes requires you to have previously run the cmetindex
                                    or atoiindex programs (which also cover the ttoc modes) on the genome
  1. Regarding the 3rd question, I am referring to the length of each tRNA that is mapped in the input sequencing data file. For example for this tRNA - Homo_sapiens_tRNA-Ala-CGC, we get counts in the Anticodon_counts_raw.txt file, how much of this tRNA sequence is found in the input data, is it mapped full length or only part of it is mapped ? Where can I find this information among the output files ?