Closed ruixuan-zhang closed 1 year ago
Hi Ruixuan,
Thanks for the error report. As you know, running mim-tRNAseq on a de novo predicted tRNA set is something we've not tested extensively.
The example sequence you sent seems to be ok. I don't think the error is coming from your command but rather something going wrong within mim-tRNAseq code. I have a feeling this might be related to your custom reference somehow.
Unfortunately I am on holiday from today for a little more than 3 weeks. When I'm back I would be happy to help. Perhaps you could share with me your reference (fasta and out file) and some of the sequencing data (a small subset is fine as long as it recreates the error) so that I can test this when I'm back? Let me know if that's an issue.
Hi Behrens,
Thank you for your fast reply. I would love to provide any information if it is needed. Have a nice vacation!
Best, Ruixuan
Hi Ruixuan,
I would like to do some testing to fix this issue. Could you please send me the three input files you are specifying as your reference tRNA sequences (acaCas-tRNA-merged.fa
, acaCas-tRNA-merged.out
, and acaCas-mitotRNAs.fa
), as well as your sample_table
file, and either a) one of the input fastq files that also generates an error for you, or b) a smaller subset of all of the fastq files you are using.
Thanks, Drew
Hi @ruixuan-zhang,
Thanks for sending me your reference tRNA sequence files. I am posting this here so that future users using custom references might also benefit.
I found a small bug in the code that I fixed. Unfortunately I encountered more problems later on. This has to do with your custom reference and the way the tRNA genes are named. A short reminder that the first of the two numbers in a tRNA gene name indicates which transcript (or isodecoder) it belongs to. The second number is the gene number within that isodecoder. For e.g., Ala-AGC-1-1 and Ala-AGC-1-2 have the same (mature) transcript sequence and are gene copy 1 and 2, respectively. However, Ala-AGC-2-1 is distinct in mature sequence. This naming is very important for mimseq, and in general to maintain correct annotation of tRNA gene sets. In your custom reference, I found multiple examples of genes that have a) different isodecoder numbers being clustered together (i.e. by name they should have different sequence, but indeed their transcript sequences are identical), or b) with the same isodecoder number being clustered separately (the opposite of a) ).
To get a full picture of the problem, you can run mimseq as you did before but instead specify --cluster-id 1
to cluster identical mature tRNAs only. Then you can look at the *clusterInfo.txt
file to see which sequences are clustering with which. This can guide how you rename your reference sequences. As an example, running mimseq as I describe on your supplied sequences I get the following in the beginning *clusterInfo.txt
file:
From this you can see how a) Gly-GCC-2, Gly-GCC-8, Gly-GCC-9 and others are all clustered with Gly-GCC-1-10 (they should not be), and b) Gly-GCC-1-1 is in a separate cluster (by itself) from Gly-GCC-1-10, Gly-GCC-1-25, etc. (it also should not be).
As an example of how the human reference set looks when clustered using --cluster-id 1
see below:
Notice how all Lys-TTT tRNAs with the same transcript sequences (i.e., the first number is the same) all cluster together (e.g., Lys-TTT-3-1 to Lys-TTT-3-5), while those with unique sequences all belong to distinct clusters.
Please update your reference sequence names according to this structure and try again. In the meantime, I will release a new mimseq version with the bugfix I mentioned (and some other updates). Please update your mimseq version in the next week or so (ensure it is v1.3.3) to get this fix and to continue testing.
Hi Drew, Thank you so much for your kindly help! I misunderstood this naming format as "ChrNum-GeneID" such as "1-1" representing the first tRNA gene on Chr1.
In my case, since Gly-GCC-1-10, Gly-GCC-1-25, Gly-GCC-1-26, Gly-GCC-2-2 ,... were clustered together. They should be named as Gly-GCC-1-1, Gly-GCC-1-2, Gly-GCC-1-3, Gly-GCC-1-4 , ...
Additionally, the preivous Gly-GCC-1-1 which was in another cluster, should be renamed such as Gly-GCC-2-1, so that it can be separated from the previous cluster, right?
I appreciate your guidance and advice on this matter. Best regards, Ruixuan
Hi Ruixuan,
Yes that's exactly right! I think the easiest way is to use mimseq with a cluster-id of 1 as I describe above (because this also uses mature, processed tRNA transcripts for clustering and not genomic sequence). See here for a nicer detailed explanation about tRNA gene symbols from the GtRNAdb folks.
Also, the new mimseq version (v1.3.3) has been released on pip and GitHub and should also be available on bioconda in the next few days.
Thank you very much for your help!!!
Hi, Good day.
Could you help me with the following error message I got when I ran mimseq?
I tried to do tRNA-seq on amoeba. Nothing can be found on database, thus I de novo predicted them using tRNAscan and modified into the corresponding format.
In the step of "Determining non-deconvoluted clusters due to insufficient coverage at mismatches".
The command I ran is
Do you have any idea of which part may be wrong? Thank you very much for your time. Best, Ruixuan