NA for all gene ages - Githubissues

glarue commented 1 year ago

Hi again @josuebarrera!

I'm trying to run GenEra on a longest-isoform proteome of Arabidopsis, and while the DIAMOND (v2.1.6) search step seems to complete successfully, the final output has NA for all genes in gene_ages.tsv. I believe there's a known issue with DIAMOND failing to propagate scientific names correctly during DB construction, but the taxonomy IDs are present and the output file contains the taxonomy IDs as expected.

Furthermore, I've spot-checked the DIAMOND output manually and in those cases, results for the genes assigned NA are present in the bout file, e.g.:

~/data$ grep 'AraTha-rna-NM_001084197.2' 3702_gene_ages.tsv
AraTha-rna-NM_001084197.2   Absent from the DIAMOND/MMseqs2 results NA  NA

~/data/tmp_3702_23548$ zgrep 'AraTha-rna-NM_001084197.2' 3702_Diamond_results.bout.gz
AraTha-rna-NM_001084117.3\tgene-AT1G23147\tNC_003070.9\t+\t8205934:8206206\t91\t8205934-8206206 AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 9.45e-09    47.8    3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 2.38e-67    196 3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 AraTha-rna-NM_001084905.2\tgene-AT4G11653\tNC_003075.7\t-\t7037086:7037358\t91\t7037086-7037358 1.70e-24    87.8    3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 AraTha-rna-NM_001084638.2\tgene-AT3G04735\tNC_003074.8\t-\t1292408:1292725\t106\t1292408-1292725    1.41e-07    45.1    3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 AraTha-rna-NM_001084117.3\tgene-AT1G23147\tNC_003070.9\t+\t8205934:8206206\t91\t8205934-8206206 2.08e-07    44.3    3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 AraTha-rna-NM_001335782.1\tgene-AT2G22055\tNC_003071.7\t+\t9379578:9379817\t80\t9379578-9379817 6.58e-07    42.7    3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 AraTha-rna-NM_104838.2\tgene-AT1G61566\tNC_003070.9\t-\t22717265:22717492\t76\t22717265-22717492    6.75e-06    40.0    3702
AraTha-rna-NM_104838.2\tgene-AT1G61566\tNC_003070.9\t-\t22717265:22717492\t76\t22717265-22717492    AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 5.70e-06    40.0    3702
AraTha-rna-NM_001335782.1\tgene-AT2G22055\tNC_003071.7\t+\t9379578:9379817\t80\t9379578-9379817 AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 5.85e-07    42.7    3702
AraTha-rna-NM_001084638.2\tgene-AT3G04735\tNC_003074.8\t-\t1292408:1292725\t106\t1292408-1292725    AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 6.50e-07    43.5    3702
AraTha-rna-NM_001084905.2\tgene-AT4G11653\tNC_003075.7\t-\t7037086:7037358\t91\t7037086-7037358 AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 1.40e-23    85.5    3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 NP_001077666.1  6.20e-63    195 3702;1240361
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 CAA0267533.1    3.59e-62    193 3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 VYS48050.1  5.98e-61    190 3702
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_002874722.1  1.31e-38    134 59689;81972
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 KAG7552807.1    3.59e-36    128 1240361
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 KAG7557481.1    2.01e-35    126 45249
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_019084701.1  1.62e-21    91.3    90675
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 NP_001078374.1  3.12e-20    87.8    3702;1240361
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 KAG7620113.1    8.93e-20    86.7    45249
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 KAG7579121.1    3.51e-17    80.1    45249;1240361
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_024010228.1  1.77e-16    78.2    72664
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_002883850.1  4.06e-16    77.4    59689;81972
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_019088361.1  2.39e-15    75.5    90675
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 EOA22711.1  1.18e-14    74.3    81985
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 ESQ38323.1  1.38e-13    70.5    72664
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_002874709.1  8.73e-12    66.2    81972
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 XP_020869967.1  3.56e-09    60.1    59689;81972
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 KAG7598041.1    2.79e-08    57.8    45249
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 RID64494.1  1.75e-06    52.8    3708;3711
AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433 KAG2273089.1    9.48e-06    50.8    52824

Any idea what might be going on? Thanks!

josuebarrera commented 1 year ago

Hello, @glarue!

It seems that the DIAMOND step ran successfully. I think the problem here is related to the complex names that were given to the Arabidopsis proteins (e.g., AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433). Specifically, I think the \t characters are being interpreted as tabs by GenEra and that is messing up the parsing process of the DIAMOND table. The DIAMOND table is separated by tabs, so when GenEra tries to identify the value in column 5 (taxid), it is instead finding part of the gene name (e.g., 13049164:13049433). Since the pipeline cannot associate this string with any taxid in the NCBI taxonomy, it registers the gene as missing (GenEra is not even able to recognize the gene matching itself due to this issue).

May I suggest you replace those \t characters with something else (both in the FASTA file and in the DIAMOND results) and then resume the pipeline?

Cheers, Josué.

glarue commented 1 year ago

Hi @josuebarrera thanks for the response - makes perfect sense.

If it's not too much trouble, might I suggest documenting this somewhere in the GenEra docs? I assume that tabs are the only whitespace character that is disallowed, but because many pipelines process only the beginning of each FASTA header up to the first whitespace character, the types of headers I (admittedly optimistically) used here haven't caused issues in the past. It's probably the case that tabs are unlikely to occur in most normal FASTAs from standardized pipelines, but it would be useful to have it explicitly documented.

Thanks again for your time.

josuebarrera commented 1 year ago

No problem @glarue!

I suspect that escape characters \ are the only ones that could interfere with the GenEra pipeline. As requested, I just added this to the README so the users are aware of the issue:

-q A standard FASTA file containing all the protein sequences of your species of interest. Make sure all your sequence headers are unique and make sure to avoid regular expressions like \t or \n in the sequence headers, as these will interfere with the pipeline.

Please let me know if you stumble into more issues!

Cheers, Josué.

glarue commented 1 year ago

@josuebarrera just a quick addendum: one reason I didn't sanitize the input file initially is because running DIAMOND on such files independently works fine (e.g., running DIAMOND to identify self-hits on the file in my initial example produces output with headers truncated at the first whitespace character)—I assume GenEra is parsing those headers internally before passing them to the alignment pipeline, and something within that process must be changing the way DIAMOND interprets the formatting.

I'm sure for most users this won't be an issue as they will be using more standardized input files, but I thought I'd provide some more context for why this wasn't immediately obvious to me as an issue.

josuebarrera / GenEra

NA for all gene ages #9