Open benyoung93 opened 11 months ago
been doing a little more sleuthing through log files.
In the funannotate-annotate.bf6cb020.log
it shows the XML command that is being run
/nethome/bdy8/mambaforge/envs/funannotate_env/bin/python /nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/aux_scripts/iprscan2annotations.py /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/iprscan.xml /scratch/projects/omics/ofav_genome/funannotate/step_5_predict/annotate_misc/annotations.iprscan.txt
So interestingly i went to look at the annotations.iprscan.txt
that the output should be written to for this step, and the file is not empty.
with a head
Ofa_008232-T1 db_xref InterPro:IPR003968
Ofa_008232-T1 db_xref InterPro:IPR003972
Ofa_008232-T1 db_xref InterPro:IPR000210
Ofa_008232-T1 db_xref InterPro:IPR005821
Ofa_008232-T1 db_xref InterPro:IPR027359
Ofa_008232-T1 db_xref InterPro:IPR003131
Ofa_008232-T1 db_xref InterPro:IPR011333
Ofa_008232-T1 db_xref InterPro:IPR028325
Ofa_008232-T1 go_process potassium ion transport|0006813||IEA
Ofa_008232-T1 go_function channel activity|0015267||IEA
and a wc -l
of the file shows 535494
lines.
So I am even more stumped now as to why it is not completing past this step. Maybe it is the last line in the XML file ?? Here is a tail -20
of the xml file just in case.
</panther-location>
</locations>
</panther-match>
<superfamilyhmmer3-match evalue="5.33E-15">
<signature ac="SSF50978" name="WD40 repeat-like">
<entry ac="IPR036322" desc="WD40-repeat-containing domain superfamily" name="WD40_repeat_dom_sf" type="HOMOLOGOUS_SUPERFAMILY"/>
<signature-library-release library="SUPERFAMILY" version="1.75"/>
</signature>
<model-ac>0049784</model-ac>
<locations>
<superfamilyhmmer3-location hmm-length="340" start="12" end="167">
<location-fragments>
<superfamilyhmmer3-location-fragment start="12" end="167" dc-status="CONTINUOUS"/>
</location-fragments>
</superfamilyhmmer3-location>
</locations>
</superfamilyhmmer3-match>
</matches>
</protein>
</protein-matches>
Okay so from even more sleuthing, it seems that the actual problem could be the combining of everything. Running the interproscan parsing script seems to go okay even when run independently (i.e. no errors thrown on terminal).
Additionally, when removing the interproscan results completely, the same problem occurs as stated above (the valueerror one).
So ye still very stumped, any and all help would be extremely appreciated :).
Just some additional notes on my pipeline
funannotate sort
on my scaffolds pasa
run independently on ISO-seq Funannotate predict
ran using the hq isoseq transcipts, pasa gff, cleaned and soft masked scaffolded genome
funannotate predict \
--input /scratch/projects/omics/ofav_genome/repeatmasker_soft/ofav_ntlink_clean_sort.fa.masked \
--out /scratch/projects/omics/ofav_genome/funannotate/step_5_predict \
--species "Orbicella faveolata" \
--strain gen_17 \
--name Ofa \
--rna_bam /scratch/projects/omics/ofav_genome/pasa/hq_transcripts_edited.fasta.clean.mm2.bam \
--pasa_gff /scratch/projects/omics/ofav_genome/pasa/ofav_db.sqlite.assemblies.fasta.transdecoder.genome.gff3 \
--organism other \
--repeats2evm \
--transcript_evidence /scratch/projects/omics/ofav_genome/pasa/hq_transcripts_edited.fasta.clean \
--keep_evm \
--optimize_augustus \
--cpus 8
Update
ran using some RNA-seq reads and the results from step_5. --name
to update my locus tags did not work here. So thats interesting.
funannotate update \
-i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict \
--cpus 8 \
--left samp1_r1.fastq samp2_r1.fastq samp3_r1.fastq samp4_r1.fastq samp5_r1.fastq samp6_r1.fastq samp7_r1.fastq samp8_r1.fastq samp9_r1.fastq \
--right samp1_r2.fastq samp2_r2.fastq samp3_r2.fastq samp4_r2.fastq samp5_r2.fastq samp6_r2.fastq samp7_r2.fastq samp8_r2.fastq samp9_r2.fastq \
--memory 100G \
--pacbio_isoseq ../../pasa/hq_transcripts_edited.fasta \
--name QW917 \
--species "Orbicella faveolata" \
--strain gen_17 \
--out /scratch/projects/omics/ofav_genome/funannotate/step_6_update
interproscan
run on the proteins.fa from update (in the update_results
folder). annotate
using results from update and predict
(the step with some problems :( )
funannotate annotate \
-i /scratch/projects/omics/ofav_genome/funannotate/step_5_predict \
--cpus 10 \
--iprscan /scratch/projects/omics/ofav_genome/funannotate/interproscan_res/Orbicella_faveolata_gen_17.proteins.fa.xml \
--species "Orbicella faveolata" \
--strain gen_17 \
--out /scratch/projects/omics/ofav_genome/funannotate/step_7_annotate \
--rename QW917
Been doing some more checking of my files for all of this. Could it be the input file I used for interproscan?
I used the protein.fasta from update in interproscan, but doing some greps
i notice that some of the sequences in the mRNA fasta are not in the proteins.fasta. Thought I would report this as it may be useful information.
While I wait for help I am running the mrna.fasta from update
in interproscan so I can then see if that one works in annotate
Okay update number 3 (should I be combining all of these into the top post??).
I did a awk and found that thee is a miscreant locus tag
awk '/^>[^[:space:]]+[[:space:]][^[:space:]]+[[:space:]][^[:space:]]+/'
>Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2-T1 Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2
Doing a grep
of novel I actually have 2 gene names that are not labelled properly. Saying that its probably (?) the one with the whitespaces that is causing the problem.
>Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2-T1 Ofa_031543_novel_gene_672_64b6f8d2_novel_gene_ 673_64b6f8d2
>Ofa_030816_novel_gene_620_64b6f8d2-T1 Ofa_030816_novel_gene_620_64b6f8d2
So then I wondered where this happened. Going into my predict_results
and running grep "novel" file
shows nothing abherrent. Even using the Ofa_031543
and Ofa_030816
shows that these are formatted properly here. BUT, going into the update_results
and grep "novel" file
shows that these genes now have novel within them an all the other stuff as well. Hmmmmmm. This is in most of the update files.
Is there any idea why this would be happening? Just a parsing issues at some point in the update
command??
Im also wondering if there is proper way to fix this rather than going into all the files and fixing it manually, or doing some awks and seds (although that does install a silent terror in my bones). Ive looked at fix
but a little confused on the usage.
Thank you for any and all advice on this :).
Hi @benyoung93. Sorry you ran into this problem. And you've done a great sleuthing job already. Indeed the error appears to be in the parsing the gene names, and for some reason it seems that a handful of gene models from PASA update (funannotate update
) appear to not have been processed/renamed properly which is then causing a problem trying to parse their names.
The simplest fix would be to just delete those two problematic gene models (you could do this with funannotate fix
). If you however think they are real then you probably don't want to do that. But based on the names that PASA assigned (they are kind of crazy), it appears to be novel genes that it thinks exist based on the transcriptome data. So ideally you'd want the names to follow a numeric progression (ie 00001 --> 00002 --> 000003, etc) as you move along the chromosomes/scaffolds/contigs. However, this isn't necessary at all. So you could also just rename those two gene models with new unique numbers, ie add 1 to the largest locus tag number you have.
And you bring up a challenging point about how to do this, which files, etc. This would appear somewhat tricky, but the good news is that the final annotation files will be the output of funannotate annotate
so you just need to fix the files in the update_results
folder that annotate
will use. So the files to "fix" here are the Genbank, GFF3, and TBL files in the update_results
folder -- you just need to fix the IDs for those two gene models. Those three files will get used in annotate to generate the necessary files for adding functional annotation.
It would be nice to figure out why those models from update were not renamed so we could fix the bug -- that would be the other "fix" is if we can get the code updated and you could re-run update it would hopefully rename the models properly. Would you be able to share the update_misc/bestmodels.gff3
? I maybe don't need the entire file, but certainly need to understand what the naming of these problematic genes look like in that file. You can email it to me if that is easier. I think the error must be in this region and likely just a format that I've never seen before: https://github.com/nextgenusfs/funannotate/blob/master/funannotate/update.py#L1748-L1761
Good morning @nextgenusfs :).
Thank you for the response :). To answer some queries
So ideally you'd want the names to follow a numeric progression (ie 00001 --> 00002 --> 000003, etc) as you move along the chromosomes/scaffolds/contigs.
This is interesting because these weird names are actually in the correct place I think (i.e. Ofa_030816_novel_gene_620_64b6f8d2-T1 Ofa_030816_novel_gene_620_64b6f8d2
is next to Ofa_030815
and Ofa030817
and on the correct contig, ofavscaf_14
). These Ofa_xxxx are then not repeated anywhere, so I just deleted the craziness and kept them as Ofa_XXXXX making sure to be consistent, and add in the relevant -T1
when needed.
So the files to "fix" here are the Genbank, GFF3, and TBL files in the update_results folder -- you just need to fix the IDs for those two gene models. Those three files will get used in annotate to generate the necessary files for adding functional annotation.
Yep this is exactly what I did. I downloaded, fixed in atom, and then reuploaded (while also tar gunzipping the original files to have a record of this). Interproscan
has just finished running and I willbe setting off annotate
momentarily to see if this fix worked.
It would be nice to figure out why those models from update were not renamed so we could fix the bug -- that would be the other "fix" is if we can get the code updated and you could re-run update it would hopefully rename the models properly. Would you be able to share the update_misc/bestmodels.gff3
I would be more than happy to send this to you :). One query I have here is that I was not able to get mysql
onto our HPC cluster (i tried so many things to get it installed, proper channels as well as hacky ways lol), so I had to get the walltime extended for the update command. It ran in 7 days. I think that I can re run the update command and as I have all the files it will just re parse and combine everything (?) so I do not have to do that 7 day wait again. Please let me know if that is right/wrong.
I have sent the files, but I think you may be right. Looking at the bestmodel.gff
those two genes are the only ones with the Ofa_xxx
and then all the novel after them (ID=Ofa_030816_novel_gene_620_64b6f8d2
). All other instances of novel (from my quick skim through) do not have the Ofa_xxx
but instead the gene name starts with novel
(e.g ID=novel_gene_1388_64b6f8d2_novel_gene_1442_64b6f8d2
). So I think the section of code you identified is the right one :).
Thank you for all the help, I will update here if fixing those names allows annotate
to complete successfully.
Ben
Also, quick query/enhancement. How possible would it be that if you get the error i did (copied below) to have some sort of awk/sed/grep that prints out the offending locus tags/ids when merging.
It may not be needed, as i have not seen any other issues like this and mine may be unique (?) but could be a nice addition.
Traceback (most recent call last):
File "/nethome/bdy8/mambaforge/envs/funannotate_env/bin/funannotate", line 10, in <module>
sys.exit(main())
File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/funannotate.py", line 716, in main
mod.main(arguments)
File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/annotate.py", line 1458, in main
GeneNames = lib.getGeneBasename(Proteins)
File "/nethome/bdy8/mambaforge/envs/funannotate_env/lib/python3.8/site-packages/funannotate/library.py", line 1081, in getGeneBasename
transcript, gene = line.split(" ")
ValueError: too many values to unpack (expected 2)
can confirm that once fixing those 2 genes in the update step, annotate successfully runs and I have all my files wooooooooooooooooooooo.
Let me know if you need any more information re the naming bug and I can provide it. i will leave this open untill we get to the bottom of that :).
Ben
Are you using the latest release? Yep :)
Describe the bug I get the following error message (in logfile section below) when running the annotate step, it is linked to the XML file from InterproScan that I ran locally. Version of interproscan is
Here is a
head -20
of the xml file from interproscan. I used theproteins.fa
fromfunannotate update
as inpur for interproscan.Here is a
grep ">" proteins.fa | head -20
of the proteins fasta from the update command.and a
head .gff3
fromfunannotae::update
What command did you issue?
Logfiles To try and be a wee bit more concise did not put the whole log file in, just a few lines above where the error comes in in my LSF .err file
and here is the entirety of the log file from the annotate command. I notice I was stupid and need to reset the busco, will do that in a later run once this error is fixed.
OS/Install Information
funannotate check --show-versions
You are running Perl v b'5.032001'. Now checking perl modules... Carp: 1.50 Clone: 0.46 DBD::SQLite: 1.72 DBD::mysql: 4.046 DBI: 1.643 DB_File: 1.858 Data::Dumper: 2.183 File::Basename: 2.85 File::Which: 1.24 Getopt::Long: 2.54 Hash::Merge: 0.302 JSON: 4.10 LWP::UserAgent: 6.67 Logger::Simple: 2.0 POSIX: 1.94 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.14 Tie::File: 1.06 URI::Escape: 5.17 YAML: 1.30 local::lib: 2.000029 threads: 2.25 threads::shared: 1.61 All 27 Perl modules installed
Checking Environmental Variables... $FUNANNOTATE_DB=/scratch/projects/omics/ofav_genome/funannotate_db $PASAHOME=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/pasa-2.5.2 $TRINITY_HOME=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/trinity-2.8.5 $EVM_HOME=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/nethome/bdy8/mambaforge/envs/funannotate_env/config/ $GENEMARK_PATH=/nethome/bdy8/mambaforge/envs/funannotate_env/opt/gmes_linux_64 All 6 environmental variables are set
Checking external dependencies... PASA: 2.5.2 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.5.0 bamtools: bamtools 2.5.1 bedtools: bedtools v2.31.0 blat: BLAT v37x1 diamond: 2.1.7 emapper.py: 2.1.11 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: 36.3.8g glimmerhmm: 3.0.4 gmap: 2023-04-28 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 17.0.3-internal kallisto: 0.46.1 mafft: v7.520 (2023/Mar/22) makeblastdb: makeblastdb 2.14.0+ minimap2: 2.26-r1175 pigz: 2.6 proteinortho: 6.2.3 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.16.1 snap: 2006-07-28 stringtie: 2.2.1 tRNAscan-SE: 2.0.6 (May 2020) tantan: tantan 40 tbl2asn: 25.8 tblastn: tblastn 2.14.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: gmes_petap.pl not installed ERROR: signalp not installed