KeyError: "sequence" during Determining non-deconvoluted clusters due to insufficient coverage at mismatches

ruixuan-zhang commented 1 year ago

Hi, Good day.

Could you help me with the following error message I got when I ran mimseq?

I tried to do tRNA-seq on amoeba. Nothing can be found on database, thus I de novo predicted them using tRNAscan and modified into the corresponding format.

>Acanthamoeba_castellanii_tRNA-Gly-GCC-1-1 (tRNAscan-SE ID: scaffold_1.trna1) scaffold_1:110250-110342 (+) Gly (GCC) 93 bp Sc: 51.8
GGCGTCGTAGCTTAGTTGGAAGAGCGTGCTAATGCCACCAAGAGCGCAAGCCATGGATGCAAGCAGGCCCCGGTTCGATTCCGGGCGACGCCA

In the step of "Determining non-deconvoluted clusters due to insufficient coverage at mismatches".

2023-03-17 09:44:42,680 [INFO ] *** Un-deconvoluted sequences being defined based on covera
ge difference more than 50.00%
2023-03-17 09:44:42,681 [INFO ] Determining unsplittable sequences...
2023-03-17 09:44:42,874 [INFO ] Calculating nucleotide coverage for /lustre/aptmp/ruixuan/r
iboseq/tRNA/mimseq/cl_09/R1_0hpi.unpaired_uniq.bam
2023-03-17 09:44:45,258 [INFO ] 9 unique sequences not split from cluster parent
2023-03-17 09:44:45,258 [INFO ] 16 parents and isododecoders not deconvoluted due to insuff
icient coverage at mismatches
Traceback (most recent call last):
  File "/usr/appli/freeware/miniconda/4.5/envs/mimseq13/bin/mimseq", line 10, in <module>
    sys.exit(main())
  File "/usr/appli/freeware/miniconda/4.5/envs/mimseq13/lib/python3.7/site-packages/mimseq/
mimseq.py", line 420, in main
    args.misinc_thresh, args.mito, args.plastid, args.pretrnas, args.local_mod, args.p_adj,
 args.crosstalks, args.sampledata)
  File "/usr/appli/freeware/miniconda/4.5/envs/mimseq13/lib/python3.7/site-packages/mimseq/
mimseq.py", line 123, in mimseq
    isodecoder_sizes, unsplitCluster_lookup = getDeconvSizes(splitBool_new, tRNA_dict, clus
ter_dict, unique_isodecoderMMs_new)
  File "/usr/appli/freeware/miniconda/4.5/envs/mimseq13/lib/python3.7/site-packages/mimseq/
splitClusters.py", line 430, in getDeconvSizes
    totalIsoSize += len([info['sequence'].upper() for info in tRNA_dict.values() if info['s
equence'].upper() == tRNA_dict[isodecoder]['sequence'].upper()])
  File "/usr/appli/freeware/miniconda/4.5/envs/mimseq13/lib/python3.7/site-packages/mimseq/
splitClusters.py", line 430, in <listcomp>
    totalIsoSize += len([info['sequence'].upper() for info in tRNA_dict.values() if info['s
equence'].upper() == tRNA_dict[isodecoder]['sequence'].upper()])
KeyError: 'sequence'

The command I ran is

mimseq -t ./acaCas-tRNA-merged.fa -o ./acaCas-tRNA-merged.out -m ./acaCas-mitotRNAs.fa --threads 8 --cluster-id 0.9 --min-cov 0.0005 --max-mismatches 0.075 --control-condition 0hpi -n APMV_test_0315 --out-dir ${OUTPUT} --remap --remap-mismatches 0.05 ${sample_table}

Do you have any idea of which part may be wrong? Thank you very much for your time. Best, Ruixuan

drewjbeh commented 1 year ago

Hi Ruixuan,

Thanks for the error report. As you know, running mim-tRNAseq on a de novo predicted tRNA set is something we've not tested extensively.

The example sequence you sent seems to be ok. I don't think the error is coming from your command but rather something going wrong within mim-tRNAseq code. I have a feeling this might be related to your custom reference somehow.

Unfortunately I am on holiday from today for a little more than 3 weeks. When I'm back I would be happy to help. Perhaps you could share with me your reference (fasta and out file) and some of the sequencing data (a small subset is fine as long as it recreates the error) so that I can test this when I'm back? Let me know if that's an issue.

ruixuan-zhang commented 1 year ago

Hi Behrens,

Thank you for your fast reply. I would love to provide any information if it is needed. Have a nice vacation!

Best, Ruixuan

drewjbeh commented 1 year ago

Hi Ruixuan,

I would like to do some testing to fix this issue. Could you please send me the three input files you are specifying as your reference tRNA sequences (acaCas-tRNA-merged.fa, acaCas-tRNA-merged.out, and acaCas-mitotRNAs.fa), as well as your sample_table file, and either a) one of the input fastq files that also generates an error for you, or b) a smaller subset of all of the fastq files you are using.

Thanks, Drew

drewjbeh commented 1 year ago

Hi @ruixuan-zhang,

Thanks for sending me your reference tRNA sequence files. I am posting this here so that future users using custom references might also benefit.

I found a small bug in the code that I fixed. Unfortunately I encountered more problems later on. This has to do with your custom reference and the way the tRNA genes are named. A short reminder that the first of the two numbers in a tRNA gene name indicates which transcript (or isodecoder) it belongs to. The second number is the gene number within that isodecoder. For e.g., Ala-AGC-1-1 and Ala-AGC-1-2 have the same (mature) transcript sequence and are gene copy 1 and 2, respectively. However, Ala-AGC-2-1 is distinct in mature sequence. This naming is very important for mimseq, and in general to maintain correct annotation of tRNA gene sets. In your custom reference, I found multiple examples of genes that have a) different isodecoder numbers being clustered together (i.e. by name they should have different sequence, but indeed their transcript sequences are identical), or b) with the same isodecoder number being clustered separately (the opposite of a) ).

To get a full picture of the problem, you can run mimseq as you did before but instead specify --cluster-id 1 to cluster identical mature tRNAs only. Then you can look at the *clusterInfo.txt file to see which sequences are clustering with which. This can guide how you rename your reference sequences. As an example, running mimseq as I describe on your supplied sequences I get the following in the beginning *clusterInfo.txt file:

Screenshot 2023-04-18 at 10 37 03

From this you can see how a) Gly-GCC-2, Gly-GCC-8, Gly-GCC-9 and others are all clustered with Gly-GCC-1-10 (they should not be), and b) Gly-GCC-1-1 is in a separate cluster (by itself) from Gly-GCC-1-10, Gly-GCC-1-25, etc. (it also should not be).

drewjbeh commented 1 year ago

As an example of how the human reference set looks when clustered using --cluster-id 1 see below:

Screenshot 2023-04-18 at 10 42 10

Notice how all Lys-TTT tRNAs with the same transcript sequences (i.e., the first number is the same) all cluster together (e.g., Lys-TTT-3-1 to Lys-TTT-3-5), while those with unique sequences all belong to distinct clusters.

Please update your reference sequence names according to this structure and try again. In the meantime, I will release a new mimseq version with the bugfix I mentioned (and some other updates). Please update your mimseq version in the next week or so (ensure it is v1.3.3) to get this fix and to continue testing.

ruixuan-zhang commented 1 year ago

Hi Drew, Thank you so much for your kindly help! I misunderstood this naming format as "ChrNum-GeneID" such as "1-1" representing the first tRNA gene on Chr1.

In my case, since Gly-GCC-1-10, Gly-GCC-1-25, Gly-GCC-1-26, Gly-GCC-2-2 ,... were clustered together. They should be named as Gly-GCC-1-1, Gly-GCC-1-2, Gly-GCC-1-3, Gly-GCC-1-4 , ...

Additionally, the preivous Gly-GCC-1-1 which was in another cluster, should be renamed such as Gly-GCC-2-1, so that it can be separated from the previous cluster, right?

I appreciate your guidance and advice on this matter. Best regards, Ruixuan

drewjbeh commented 1 year ago

Hi Ruixuan,

Yes that's exactly right! I think the easiest way is to use mimseq with a cluster-id of 1 as I describe above (because this also uses mature, processed tRNA transcripts for clustering and not genomic sequence). See here for a nicer detailed explanation about tRNA gene symbols from the GtRNAdb folks.

Also, the new mimseq version (v1.3.3) has been released on pip and GitHub and should also be available on bioconda in the next few days.

ruixuan-zhang commented 1 year ago

Thank you very much for your help!!!

nedialkova-lab / mim-tRNAseq

KeyError: "sequence" during Determining non-deconvoluted clusters due to insufficient coverage at mismatches #41