Closed ferninfm closed 2 years ago
Those are long protein names - I don't know if that is problem but usually the IDs will be just the GT1_008627-T1 or JCM9580_007 - shouldn't break the parser necessarily but that looks odd to me
Dear Jason, I see your point. I tried to batch reprocess some published genomes without loosing information and was obviously too generous with the names. I can imagine the point in GCA001599015.1 could be a source of problems Let's see if I can solve this today.
Fernando
So I renamed proteins using sed in all files then repeated annotate and the error remains.
Somehow funannotate annotate that degrades the sequence names while copying the fasta files from predict_results to annotate_misc. A space is included in within the fasta name string
==> Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/predict_results/GCA_001599015.1_Apiotrichum_domesticum_JCM_9580.proteins.fa <==
>GCA001599015.1ApiotrichumdomesticumJCM9580_000001-T1 GCA001599015.1ApiotrichumdomesticumJCM9580_000001
==> Apiotrichum_domesticum_JCM_9580_GCA_001599015.1/annotate_misc/genome.proteins.fasta <==
>GCA001599015.1ApiotrichumdomesticumJCM9580_000 001-T1 GCA001599015.1ApiotrichumdomesticumJCM9580_000 001
This happens irrespective of name size
==> predict_results/GCA_002973495.1_Apiotrichum_akiyoshidainum_HP2023.mrna-transcripts.fa <==
>GCA002973495_000001-T1 GCA002973495_000001
==> annotate_misc/genome.transcripts.fasta <==
>GCA002973495_ 000001-T1 GCA002973495_ 000001
This damages both the PFam and dbCan result files, although the first does not detect an error. I checked and never had it before this time, but I can reproduce it in two different installations of funannotate 1.7.4 and latest in two operating systems. So it may be my script (?)
I reconstituted the transcript and protein files using sed to replace 's/ //g' by hand and tried to run not using --force but the files are rewritten anyway. The spaces fallin the same place in each naming convention but in different places depending on the length of the name (long name 9 characters from the end, short name 7 characters...)
I am very puzzled with this
I'm not sure what is going on here -- but certainly the names are too long to be parsed by biopython HMM parser that is why it is failing. I would delete results from funannotate annotate and then re-run funannotate predict with something more logical in the --name
field, ie if you wanted to use the strain isolate name that would work: --name HP2023
-- this should re-use the existing data you have for predict but generate better gene model names.
It seems you must have run --name GCA001599015.1ApiotrichumdomesticumJCM9580
on the first pass? I didn't have a length limit on the name variable as I never imagined it would be a problem -- its designed to use NCBI locus_tag identifier which is at least 3 alpha numeric characters (for genome submissions it is assigned by NCBI if you don't provide your own). The proposal is here
The locustag prefix can contain only alpha-numeric characters and it must be at least 3 characters long. It should start with a letter, but numerals can be in the 2nd position or later in the string. (ex. A1C). There should be no symbols, such as -* in the prefix. The locustag prefix is to be separated from the tag value by an underscore ‘’, eg A1C_00001.
Most commonly now it is 4 or 5 characters, AB01.
Hi Jon,
Thanks for clarifying the locus_tag ideas.
For the record the length of the locus tag is not the problem. I erased the annotate results and rerun, and still when it runs it inserts spaces within the names of the fasta mrna.transcript and protein files, which is wat causes the parsing error.
As for the long locus_tag, it is not mine, It was created by predict because i specified --species, --strain and --isolate, but forgot to specify the locus tag with --name, so they all get pasted as locus tag (without underscores).
I did rerun predict with --name and repeated annotate. Somehow the files are not altered that way, although the files in predict_results look identical....
Thanks
The default for the --name parameter in predict is FUN, so something else must have happened during the first run perhaps a mistake in a wrapper script or something of that nature.
The fasta headers are printed with a space as it is a common format, ">transcriptid genename". Biopython will parse this as rec.id == transcriptid and rec.description == genename.
Hi, My mistake I did in fact specify the long --name in the batch submission, you were roght all along... I mean this has become an absolutely irrelevant issue, more to do with me trying to batch submt to the cluster instead of specifying the parameters one by one... However
That the fasta headers are printed with a space is clear, I have eyes to see. The question is how and why are additional systematically placed spaces appearing within the transcript and genenames, not between them. Spaces that otherwise did not exist in the predict_results but are generated when copying/parsing those files into the annotate_misc input files.
I think that behaviour may be caused by having underscores in the name? Anyway. Irrelevant issue.
Thanks for the help I was a stuck trying to troubleshoot a simple mistake.
Take care
Yes I think you are right that the underscores are problematic for the locus tag. This is because funannotate assumes each locus rage will have one underscore between the locus tag prefix and the numerical portion -- this is defined in NCBIs docs so that's why it uses that convention.
Hi, I have run of ideas on how to solve this.
Problem: funannotate annotate breaks in the dbCAN stage.
For each run it breaks at a different ID.
What I did:
0) Upgraded to latest github version. 1) I went bottom up and tested multiple things (See below) which were not useful. 2) The analyses do run, dbCAN.txt has been populated (--cpu 1) but dbCAN.filtered.txt is not (has a header) 3) The partitioning and runs, Chunk_*.dbcan.txt have been populated (--cpu 32), also dbCAN.txt but dbCAN.filtered.txt is not (has a header) 4) The problem has to do with parsing using the SearchIO.parse module of biopython (Line 120) 5) Head of dbCAN.txt
6) Head of a different analysis that worked, format has not changed
7) Upgrading to biopython 1.79 did not change the behaviour and gave a compatibility warning with eggnog mapper (requires biopython v1.76) 8) Downgrading to 1.76 also did not change things
I also checked other possibilities 1) I updated the dbCAN database using funannotate 2) I checked dbCAN.hmm and had no duplicated hmms 3) I erased the files produced by hmmpress and rebuilt them (The realised this has no influence whatsoever) 4) I run "hmmscan --domtblout eraseme_outfiles2 --cpu 12 -E 1e-17 $FUNANNOTATE_DB/dbCAN.hmm $FILE" as found in line 87 of hmmer_parallel.py. It runs without a problem 5) I thouth the problem was in line 193 when splitProts is generated (to parallelize?), but when I rerun annotate toggling --cpus 1 the problem persists
Complete error output
funannotate check