Closed glarue closed 1 year ago
Hello, @glarue!
It seems that the DIAMOND step ran successfully. I think the problem here is related to the complex names that were given to the Arabidopsis proteins (e.g., AraTha-rna-NM_001084197.2\tgene-AT1G35467\tNC_003070.9\t+\t13049164:13049433\t90\t13049164-13049433
). Specifically, I think the \t
characters are being interpreted as tabs by GenEra and that is messing up the parsing process of the DIAMOND table. The DIAMOND table is separated by tabs, so when GenEra tries to identify the value in column 5 (taxid), it is instead finding part of the gene name (e.g., 13049164:13049433
). Since the pipeline cannot associate this string with any taxid in the NCBI taxonomy, it registers the gene as missing (GenEra is not even able to recognize the gene matching itself due to this issue).
May I suggest you replace those \t
characters with something else (both in the FASTA file and in the DIAMOND results) and then resume the pipeline?
Cheers, Josué.
Hi @josuebarrera thanks for the response - makes perfect sense.
If it's not too much trouble, might I suggest documenting this somewhere in the GenEra docs? I assume that tabs are the only whitespace character that is disallowed, but because many pipelines process only the beginning of each FASTA header up to the first whitespace character, the types of headers I (admittedly optimistically) used here haven't caused issues in the past. It's probably the case that tabs are unlikely to occur in most normal FASTAs from standardized pipelines, but it would be useful to have it explicitly documented.
Thanks again for your time.
No problem @glarue!
I suspect that escape characters \
are the only ones that could interfere with the GenEra pipeline.
As requested, I just added this to the README so the users are aware of the issue:
-q
A standard FASTA file containing all the protein sequences of your species of interest. Make sure all your sequence headers are unique and make sure to avoid regular expressions like \t
or \n
in the sequence headers, as these will interfere with the pipeline.Please let me know if you stumble into more issues!
Cheers, Josué.
@josuebarrera just a quick addendum: one reason I didn't sanitize the input file initially is because running DIAMOND
on such files independently works fine (e.g., running DIAMOND
to identify self-hits on the file in my initial example produces output with headers truncated at the first whitespace character)—I assume GenEra
is parsing those headers internally before passing them to the alignment pipeline, and something within that process must be changing the way DIAMOND
interprets the formatting.
I'm sure for most users this won't be an issue as they will be using more standardized input files, but I thought I'd provide some more context for why this wasn't immediately obvious to me as an issue.
Hi again @josuebarrera!
I'm trying to run
GenEra
on a longest-isoform proteome of Arabidopsis, and while theDIAMOND
(v2.1.6) search step seems to complete successfully, the final output hasNA
for all genes ingene_ages.tsv
. I believe there's a known issue withDIAMOND
failing to propagate scientific names correctly during DB construction, but the taxonomy IDs are present and the output file contains the taxonomy IDs as expected.Furthermore, I've spot-checked the
DIAMOND
output manually and in those cases, results for the genes assignedNA
are present in thebout
file, e.g.:Any idea what might be going on? Thanks!