Open evan-john opened 7 months ago
I've created a branch with the changes to support the GFF file. Let me know if it doesn't handle the GFF file as you want or if you notice any errors. There may be inconsistencies between the output using the feature table vs. GFF due to redundant information in the GFF.
Major changes:
Managed to compile the new updates, however I'm not sure if it's possible to run locally still, e.g.
orthorefine.exe --input input.txt --OF_file Orthogroups.tsv --window_size 8 --synteny_ratio 0.5 --outfile Orthorefined
where the input.txt is a file that looks like this: G1.gff G2.gff G3.gff ...
and all the corresponding gff files are contained within the executed directory along with 'Orthogroups.tsv' Currently the command returns 'Error feature table file missing'
Drop the file extensions in the input.txt file. OrthoRefine will search for files ending in "feature_table.txt" and then "genomic.gff". I used the RefSeq gff naming file standard. For example, if the local file is "GCF_000009025.1_ASM902v1_genomic.gff", then it needs to be "GCF_000009025.1" in the input.txt.
Thanks for the details, managed to read in the gff files locally now. Also in case others are reading, the files must contain the 'GCF_' prefix in line with RefSeq or won't be recognised.
The final stage seems to be an issue with the gff format itself. I have used FunAnnotate, popular for fungal genomes, to format my GFF annotations. Specifically the format lacks 'locus_tag=' 'gene=' 'product=' 'Name=' etc. Are any/all of these strings strict requirements I could add (and to which feature specifically) to facilitate OrthoRefine interpretation/cross-reference the N0.tsv?
Chr1 GAF gene 1 1413 . - . ID=XXXX_000011 Chr1 GAF mRNA 1 1413 . - . ID=XXXX_000011-T1;Parent=XXXX_000011 Chr1 GAF exon 686 1413 . - . ID=XXXX_000011-T1.exon1;Parent=XXXX_000011-T1; Chr1 GAF exon 466 612 . - . ID=XXXX_000011-T1.exon2;Parent=XXXX_000011-T1; Chr1 GAF exon 1 406 . - . ID=XXXX_000011-T1.exon3;Parent=XXXX_000011-T1; Chr1 GAF CDS 686 1413 . - 0 ID=XXXX_000011-T1.cds;Parent=XXXX_000011-T1; Chr1 GAF CDS 466 612 . - 1 ID=XXXX_000011-T1.cds;Parent=XXXX_000011-T1; Chr1 GAF CDS 1 406 . - 1 ID=XXXX_000011-T1.cds;Parent=XXXX_000011-T1;
When using the protein.faa from NCBI, OrthoFinder outputs the first bit of text from the header lines in the fasta file into N0.tsv, e.g. "WP_000002459.1" - the product accession. Regardless of the source and information in the N0.tsv file, the annotation used by OrthoRefine has to contain the same information to match against. OrthoRefine requires the product accession (Name from .gff) to match against to read in information, the locus tag (locus_tag) information for the analysis, and the gene information (product) to print to the outfile.
If this information is missing, an easy fix it to append the ID info as the locus_tag, product etc. This awk command should do that based on the example.
awk 'match($0,/ID=[^;]*;/) { print $0 "Name" substr($0, RSTART+2, RLENGTH-2) "product" substr($0, RSTART+2, RLENGTH-2) "gene" substr($0, RSTART+2, RLENGTH-2) "locus_tag" substr($0, RSTART+2, RLENGTH-2)}' $file
If the gff file is more than the example above or the format causes additional errors, you can post or email the file (or part of it using the bash head command) and I can write a bash command to change the file or change OrthoRefine depending on the complexity.
That awk one-liner works well thanks.
One final formatting issue I managed to workaround is in the gff filenames. Originally I had multiple underscores. e.g. GCF_XX_ABCD.gff which caused some issues where the wrong file was being read if the substring 'XX' was matching. By editing to GCF_XXABCD.gff (and in corresponding input.txt & N0.tsv headers) seemed to solve the issue.
We are up and running now.
Thank you very much for creating such great software.
When I use a local GFF file as input and run the command "OrthoRefine.exe --input input.txt --OF_file N0.tsv --window_size 8 --synteny_ratio 0.5", the error message
"terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_M_replace_aux
Aborted (core dumped)“ appears.
How can I solve this?
This error is probably due to how OrthoRefine searches for matching file names based on the contents of the input file and the header line from N0.tsv. You can run the bash command below and upload the output file here or send the output to me in an email so I can take a look (the command will print all the current file names in the directory).
cat input.txt > upload_file && head -1 N0.tsv >> upload_file && ls >> upload_file
Thank you so much. The following is the content in upload_file: GCF_Patellavulgata GCF_Physellaacuta HOG OG Gene Tree Parent Clade GCF_Patellavulgata GCF_Physellaacuta GCF_Patellavulgata_feature_table.txt GCF_Physellaacuta_feature_table.txt input.txt N0.tsv upload_file
Are the annotation files GFF or feature table?
The annotation file is GFF. When I use GCF_Patellavulgata.gff, the error "Error feature table file missing" displayed. When I rename GCF_Patellavulgata.gff to GCF_Patellavulgata_feature_table.txt, there is no such error message, but "terminate called after throwing an instance of ' std::length_error' what(): basic_string::_M_replace_aux Aborted (core dumped)" appears.
Try using "genomic.gff" instead of ".gff" as the file extensions. For example, the E. coli's gff file that I use when testing is "GCF_000005845.2_ASM584v2_genomic.gff".
When I use "GCF_Patellavulgata_genomic.gff", the error "Error feature table file missing" appears.
Can you verify that the OrthoRefine.cpp you are compiling from is the gff test branch and not the main? The gff OrthoRefine.cpp will have a copyright of 2024 listed on line 2 instead of 2023 and line 27 will say:
int file_type; // 0 = refseq, 1 = gff
When I use "GCF_Patellavulgata_genomic.gff" with the test branch code, I do not get an error message; however, I get the error message with the main code.
I recompile the OrthoRefine.cpp from the gff test branch and still encounter the error "terminate called after throwing an instance of 'std::length_error' what(): basic_string::_M_replace_aux Aborted (core dumped)"
Use the head command to extract the first 30 lines of one of the gff files so I can check the format.
head -30 $file_genomic.gff
The following is the content of the first 30 lines of the GFF file:
NC_065879.2 RefSeq region 1 95737551 . + . ID=NC_065879.2:1..95737551;Dbxref=taxon:6465;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA NC_065879.2 Gnomon gene 83647 93123 . + . ID=gene-LOC126822048;Dbxref=GeneID:126822048;Name=LOC126822048;description=uncharacterized LOC126822048;gbkey=Gene;gene=LOC126822048;gene_biotype=lncRNA NC_065879.2 Gnomon lnc_RNA 83647 93123 . + . ID=rna-XR_007680621.1;Parent=gene-LOC126822048;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;Name=XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 83647 83878 . + . ID=exon-XR_007680621.1-1;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 88949 89047 . + . ID=exon-XR_007680621.1-2;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 90047 90145 . + . ID=exon-XR_007680621.1-3;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 90308 90379 . + . ID=exon-XR_007680621.1-4;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 90650 90706 . + . ID=exon-XR_007680621.1-5;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 90968 91021 . + . ID=exon-XR_007680621.1-6;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 91259 91321 . + . ID=exon-XR_007680621.1-7;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 91938 92024 . + . ID=exon-XR_007680621.1-8;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 92349 92408 . + . ID=exon-XR_007680621.1-9;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 92657 92687 . + . ID=exon-XR_007680621.1-10;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon exon 92846 93123 . + . ID=exon-XR_007680621.1-11;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 NC_065879.2 Gnomon gene 130262 138460 . + . ID=gene-LOC130014432;Dbxref=GeneID:130014432;Name=LOC130014432;description=uncharacterized LOC130014432;gbkey=Gene;gene=LOC130014432;gene_biotype=lncRNA NC_065879.2 Gnomon lnc_RNA 130262 138460 . + . ID=rna-XR_008789430.1;Parent=gene-LOC130014432;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;Name=XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 130262 130462 . + . ID=exon-XR_008789430.1-1;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 133617 133715 . + . ID=exon-XR_008789430.1-2;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 135433 135504 . + . ID=exon-XR_008789430.1-3;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 135758 135808 . + . ID=exon-XR_008789430.1-4;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 136229 136279 . + . ID=exon-XR_008789430.1-5;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 136516 136578 . + . ID=exon-XR_008789430.1-6;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 136824 136913 . + . ID=exon-XR_008789430.1-7;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 137307 137366 . + . ID=exon-XR_008789430.1-8;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 137977 138007 . + . ID=exon-XR_008789430.1-9;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon exon 138183 138460 . + . ID=exon-XR_008789430.1-10;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 NC_065879.2 Gnomon gene 173822 204032 . - . ID=gene-LOC126818000;Dbxref=GeneID:126818000;Name=LOC126818000;description=serine/threonine-protein kinase 25;gbkey=Gene;gene=LOC126818000;gene_biotype=protein_coding NC_065879.2 Gnomon mRNA 173822 204032 . - . ID=rna-XM_050545848.1;Parent=gene-LOC126818000;Dbxref=GeneID:126818000,GenBank:XM_050545848.1;Name=XM_050545848.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=mRNA;gene=LOC126818000;model_evidence=Supporting evidence includes similarity to: 27 Proteins;product=serine/threonine-protein kinase 25%2C transcript variant X9;transcript_id=XM_050545848.1 NC_065879.2 Gnomon exon 203736 204032 . - . ID=exon-XM_050545848.1-1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XM_050545848.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=mRNA;gene=LOC126818000;product=serine/threonine-protein kinase 25%2C transcript variant X9;transcript_id=XM_050545848.1 NC_065879.2 Gnomon exon 198056 198289 . - . ID=exon-XM_050545848.1-2;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XM_050545848.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=mRNA;gene=LOC126818000;product=serine/threonine-protein kinase 25%2C transcript variant X9;transcript_id=XM_050545848.1
Does the gff file have header lines at the start of the file, lines that start with the "#"? If so, can you copy and paste at least one of those lines?
gff-version 3
!gff-spec-version 1.21
!processor NCBI annotwriter
!genome-build ASM584v2
!genome-build-accession NCBI_Assembly:GCF_000005845.2
sequence-region NC_000913.3 1 4641652
Does "CDS" ever appear in the 3rd column where "gene", "lnc_RNA", and "exon" appear? If so, can you copy and paste at least one of those lines?
NC_000913.3 RefSeq CDS 190 255 . + 0 ID=cds-NP_414542.1;Parent=gene-b0001;Dbxref=UniProtKB/Swiss-Prot:P0AD86,GenBank:NP_414542.1,ASAP:ABE-0000006,ECOCYC:EG11277,GeneID:944742;Name=NP_414542.1;gbkey=CDS;gene=thrL;locus_tag=b0001;orig_transcript_id=gnl|b0001|mrna.NP_414542;product=thr operon leader peptide;protein_id=NP_414542.1;transl_table=11
Can check with this command
cat GCF_Patellavulgata_genomic.gff | cut -f3 | grep -c "CDS"
It appears the file does not contain "locus_tag=" - which is not a problem. We should be able to use the "ID=gene-" part of the line, but let me know if "locus_tag=" is present or not in the gff.
NC_000913.3 RefSeq CDS 3720049 3720261 . + 0 ID=cds-NP_418012.1;Parent=gene-b3556;Dbxref=UniProtKB/Swiss-Prot:P0A9X9,GenBank:NP_418012.1,ASAP:ABE-0011616,ECOCYC:EG10166,GeneID:948070;Name=NP_418012.1;gbkey=CDS;gene=cspA;locus_tag=b3556;orig_transcript_id=gnl|b3556|mrna.NP_418012;product=cold shock protein CspA;protein_id=NP_418012.1;transl_table=11
Can check with this command
grep -m 10 "locus_tag=" GCF_Patellavulgata_genomic.gff
Header lines that start with the "#" are:
"CDS" appear in the 3rd column: NC_065879.2 Gnomon CDS 203736 203756 . - 0 ID=cds-XP_050401805.1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XP_050401805.1;Name=XP _050401805.1;gbkey=CDS;gene=LOC126818000;product=serine/threonine-protein kinase 26 isoform X8;protein_id=XP_050401805.1 NC_065879.2 Gnomon CDS 198056 198289 . - 0 ID=cds-XP_050401805.1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XP_050401805.1;Name=XP _050401805.1;gbkey=CDS;gene=LOC126818000;product=serine/threonine-protein kinase 26 isoform X8;protein_id=XP_050401805.1 NC_065879.2 Gnomon CDS 195530 195695 . - 0 ID=cds-XP_050401805.1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XP_050401805.1;Name=XP _050401805.1;gbkey=CDS;gene=LOC126818000;product=serine/threonine-protein kinase 26 isoform X8;protein_id=XP_050401805.1
The GFF file does not contain "locus_tag=", but it contains "ID=gene-"
Hello,
I also have a problem with the error : ''Error feature table file missing". I tried to apply all the advices you provided here such as :
-add 'GCF_' prefix and 'genomic.gff' suffix on gff files (example for one of my file : GCF_Homo_sapiens.GRCh38.110_genomic.gff )
-use branch version ('2024 version' according to the second line of the code)
-my input text files is named 'input.txt' and contains gff files name with prefix removed, like this : 'GCF_Homo_sapiens.GRCh38.110'
Finally, that's how my command looks like : $myspace/scripts/all_Orthorefine/OrthoRefine_gff_compatible/orthorefine.exe --input input.txt --OF_file $myspace/data/Orthorefine/test/OrthoFinder/Results_mai30/Phylogenetic_Hierarchical_Orthogroups/N0.tsv --window_size 8 --synteny_ratio 0.5
Thanks in advance for your help.
@SamLlm If the data is publicly available (can be downloaded from NCBI or elsewhere), post the input.txt contents (the genome accessions) or send it to me in an email. If not, let me know and I'll find a way to safely share it. I was trying to troubleshoot this with gff files I obtained from NCBI, but have been unable to recreate the error.
@ZhaoyanZhong I believe I have a fix; however, I need to test it. I'm also working on editing the isoform removal script for gff files as the redundant information will cause OrthoRefine to work less accurately.
@jl02142 yes I would like to send it to you because it is not available on NCBI. Can you provide me an email I could send it to ?
jmludwig@uga.edu
As an update on the "Error feature table file missing",
The file names need a certain format for OrthoRefine to pattern match against. For example, if you are using human and mouse the input file would be:
GCF_Homo.sapiens
GCF_Mus.musculus
and the files in the directory would be named:
GCF_Homo.sapiens_GRCh38.pep.all.fa
GCF_Mus.musculus_GRCm39.pep.all.fa
GCF_Mus.musculus_GRCm39.112.genomic.gff
GCF_Homo.sapiens_GRCh38.112.genomic.gff
The pattern needs to be GCF**.genomic.gff where all the text before the second underscore needs to be the text in the input file.
I've pushed an update to OrthoRefine.cpp on the test branch that will resolve some issues.
If "locus_tag" is not present in GFF file, try "Name" and then try "protein_id" instead. Handle if some fields in the GFF have "cds:" which prevented pattern matching.
Thankyou for building the new tool to complement Orthofinder. The features are currently required in NCBI feature format. Is it possible to implement a feature to use a local GFF file instead as an input? This would be very useful for newly in-house annotated genomes.