evan-john commented 7 months ago

Thankyou for building the new tool to complement Orthofinder. The features are currently required in NCBI feature format. Is it possible to implement a feature to use a local GFF file instead as an input? This would be very useful for newly in-house annotated genomes.

jl02142 commented 7 months ago

I've created a branch with the changes to support the GFF file. Let me know if it doesn't handle the GFF file as you want or if you notice any errors. There may be inconsistencies between the output using the feature table vs. GFF due to redundant information in the GFF.

Major changes:

download_ft_fafiles.sh updated to download GFF files from NCBI instead of feature table files using the -gff option as the second command-line parameter.
master_OrthoRefine.sh updated to pass the -gff option to download_ft_fafiles.sh.
OrthoRefine.cpp updated to search for a feature table file first per genome and then search for a gff file. Users can mix feature table files and GFF files between genomes; if a feature table file and GFF file exists for the same genome, the feature table file will be used.

evan-john commented 6 months ago

Managed to compile the new updates, however I'm not sure if it's possible to run locally still, e.g.

orthorefine.exe --input input.txt --OF_file Orthogroups.tsv --window_size 8 --synteny_ratio 0.5 --outfile Orthorefined

where the input.txt is a file that looks like this: G1.gff G2.gff G3.gff ...

and all the corresponding gff files are contained within the executed directory along with 'Orthogroups.tsv' Currently the command returns 'Error feature table file missing'

jl02142 commented 6 months ago

Drop the file extensions in the input.txt file. OrthoRefine will search for files ending in "feature_table.txt" and then "genomic.gff". I used the RefSeq gff naming file standard. For example, if the local file is "GCF_000009025.1_ASM902v1_genomic.gff", then it needs to be "GCF_000009025.1" in the input.txt.

evan-john commented 6 months ago

Thanks for the details, managed to read in the gff files locally now. Also in case others are reading, the files must contain the 'GCF_' prefix in line with RefSeq or won't be recognised.

The final stage seems to be an issue with the gff format itself. I have used FunAnnotate, popular for fungal genomes, to format my GFF annotations. Specifically the format lacks 'locus_tag=' 'gene=' 'product=' 'Name=' etc. Are any/all of these strings strict requirements I could add (and to which feature specifically) to facilitate OrthoRefine interpretation/cross-reference the N0.tsv?

Example gff from FunAnnotate funannotate util gff-rename

Chr1 GAF gene 1 1413 . - . ID=XXXX_000011 Chr1 GAF mRNA 1 1413 . - . ID=XXXX_000011-T1;Parent=XXXX_000011 Chr1 GAF exon 686 1413 . - . ID=XXXX_000011-T1.exon1;Parent=XXXX_000011-T1; Chr1 GAF exon 466 612 . - . ID=XXXX_000011-T1.exon2;Parent=XXXX_000011-T1; Chr1 GAF exon 1 406 . - . ID=XXXX_000011-T1.exon3;Parent=XXXX_000011-T1; Chr1 GAF CDS 686 1413 . - 0 ID=XXXX_000011-T1.cds;Parent=XXXX_000011-T1; Chr1 GAF CDS 466 612 . - 1 ID=XXXX_000011-T1.cds;Parent=XXXX_000011-T1; Chr1 GAF CDS 1 406 . - 1 ID=XXXX_000011-T1.cds;Parent=XXXX_000011-T1;

jl02142 commented 6 months ago

When using the protein.faa from NCBI, OrthoFinder outputs the first bit of text from the header lines in the fasta file into N0.tsv, e.g. "WP_000002459.1" - the product accession. Regardless of the source and information in the N0.tsv file, the annotation used by OrthoRefine has to contain the same information to match against. OrthoRefine requires the product accession (Name from .gff) to match against to read in information, the locus tag (locus_tag) information for the analysis, and the gene information (product) to print to the outfile.

If this information is missing, an easy fix it to append the ID info as the locus_tag, product etc. This awk command should do that based on the example.

awk 'match($0,/ID=[^;]*;/) { print $0 "Name" substr($0, RSTART+2, RLENGTH-2) "product" substr($0, RSTART+2, RLENGTH-2) "gene" substr($0, RSTART+2, RLENGTH-2) "locus_tag" substr($0, RSTART+2, RLENGTH-2)}' $file

If the gff file is more than the example above or the format causes additional errors, you can post or email the file (or part of it using the bash head command) and I can write a bash command to change the file or change OrthoRefine depending on the complexity.

evan-john commented 6 months ago

That awk one-liner works well thanks.

One final formatting issue I managed to workaround is in the gff filenames. Originally I had multiple underscores. e.g. GCF_XX_ABCD.gff which caused some issues where the wrong file was being read if the substring 'XX' was matching. By editing to GCF_XXABCD.gff (and in corresponding input.txt & N0.tsv headers) seemed to solve the issue.

We are up and running now.

ZhaoyanZhong commented 6 months ago

Thank you very much for creating such great software. When I use a local GFF file as input and run the command "OrthoRefine.exe --input input.txt --OF_file N0.tsv --window_size 8 --synteny_ratio 0.5", the error message "terminate called after throwing an instance of 'std::length_error' what(): basic_string::_M_replace_aux Aborted (core dumped)“ appears.
How can I solve this?

jl02142 commented 6 months ago

This error is probably due to how OrthoRefine searches for matching file names based on the contents of the input file and the header line from N0.tsv. You can run the bash command below and upload the output file here or send the output to me in an email so I can take a look (the command will print all the current file names in the directory).

cat input.txt > upload_file && head -1 N0.tsv >> upload_file && ls >> upload_file

ZhaoyanZhong commented 6 months ago

Thank you so much. The following is the content in upload_file: GCF_Patellavulgata GCF_Physellaacuta HOG OG Gene Tree Parent Clade GCF_Patellavulgata GCF_Physellaacuta GCF_Patellavulgata_feature_table.txt GCF_Physellaacuta_feature_table.txt input.txt N0.tsv upload_file

jl02142 commented 6 months ago

Are the annotation files GFF or feature table?

ZhaoyanZhong commented 6 months ago

The annotation file is GFF. When I use GCF_Patellavulgata.gff, the error "Error feature table file missing" displayed. When I rename GCF_Patellavulgata.gff to GCF_Patellavulgata_feature_table.txt, there is no such error message, but "terminate called after throwing an instance of ' std::length_error' what(): basic_string::_M_replace_aux Aborted (core dumped)" appears.

jl02142 commented 6 months ago

Try using "genomic.gff" instead of ".gff" as the file extensions. For example, the E. coli's gff file that I use when testing is "GCF_000005845.2_ASM584v2_genomic.gff".

ZhaoyanZhong commented 6 months ago

When I use "GCF_Patellavulgata_genomic.gff", the error "Error feature table file missing" appears.

jl02142 commented 6 months ago

Can you verify that the OrthoRefine.cpp you are compiling from is the gff test branch and not the main? The gff OrthoRefine.cpp will have a copyright of 2024 listed on line 2 instead of 2023 and line 27 will say:

int file_type;                          // 0 = refseq, 1 = gff

When I use "GCF_Patellavulgata_genomic.gff" with the test branch code, I do not get an error message; however, I get the error message with the main code.

ZhaoyanZhong commented 6 months ago

I recompile the OrthoRefine.cpp from the gff test branch and still encounter the error "terminate called after throwing an instance of 'std::length_error' what(): basic_string::_M_replace_aux Aborted (core dumped)"

jl02142 commented 6 months ago

Use the head command to extract the first 30 lines of one of the gff files so I can check the format.

head -30 $file_genomic.gff

ZhaoyanZhong commented 6 months ago

The following is the content of the first 30 lines of the GFF file:

NC_065879.2 RefSeq region 1 NC_065879.2 Gnomon gene NC_065879.2 Gnomon lnc_RNA 83647 NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon gene NC_065879.2 Gnomon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon NC_065879.2 Gnomon gene NC_065879.2 Gnomon mRNA NC_065879.2 Gnomon exon NC_065879.2 Gnomon exon 95737551 . + . ID=NC_065879.2:1..95737551;Dbxref=taxon:6465;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA 83647 93123 . + . ID=gene-LOC126822048;Dbxref=GeneID:126822048;Name=LOC126822048;description=uncharacterized LOC126822048;gbkey=Gene;gene=LOC126822048;gene_biotype=lncRNA 93123 . + . ID=rna-XR_007680621.1;Parent=gene-LOC126822048;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;Name=XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 83647 83878 . + . ID=exon-XR_007680621.1-1;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 88949 89047 . + . ID=exon-XR_007680621.1-2;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 90047 90145 . + . ID=exon-XR_007680621.1-3;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 90308 90379 . + . ID=exon-XR_007680621.1-4;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 90650 90706 . + . ID=exon-XR_007680621.1-5;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 90968 91021 . + . ID=exon-XR_007680621.1-6;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 91259 91321 . + . ID=exon-XR_007680621.1-7;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 91938 92024 . + . ID=exon-XR_007680621.1-8;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 92349 92408 . + . ID=exon-XR_007680621.1-9;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 92657 92687 . + . ID=exon-XR_007680621.1-10;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 92846 93123 . + . ID=exon-XR_007680621.1-11;Parent=rna-XR_007680621.1;Dbxref=GeneID:126822048,GenBank:XR_007680621.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC126822048;product=uncharacterized LOC126822048;transcript_id=XR_007680621.1 130262 138460 . + . ID=gene-LOC130014432;Dbxref=GeneID:130014432;Name=LOC130014432;description=uncharacterized LOC130014432;gbkey=Gene;gene=LOC130014432;gene_biotype=lncRNA lnc_RNA 130262 138460 . + . ID=rna-XR_008789430.1;Parent=gene-LOC130014432;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;Name=XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 130262 130462 . + . ID=exon-XR_008789430.1-1;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 133617 133715 . + . ID=exon-XR_008789430.1-2;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 135433 135504 . + . ID=exon-XR_008789430.1-3;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 135758 135808 . + . ID=exon-XR_008789430.1-4;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 136229 136279 . + . ID=exon-XR_008789430.1-5;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 136516 136578 . + . ID=exon-XR_008789430.1-6;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 136824 136913 . + . ID=exon-XR_008789430.1-7;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 137307 137366 . + . ID=exon-XR_008789430.1-8;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 137977 138007 . + . ID=exon-XR_008789430.1-9;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 138183 138460 . + . ID=exon-XR_008789430.1-10;Parent=rna-XR_008789430.1;Dbxref=GeneID:130014432,GenBank:XR_008789430.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=ncRNA;gene=LOC130014432;product=uncharacterized LOC130014432;transcript_id=XR_008789430.1 173822 204032 . - . ID=gene-LOC126818000;Dbxref=GeneID:126818000;Name=LOC126818000;description=serine/threonine-protein kinase 25;gbkey=Gene;gene=LOC126818000;gene_biotype=protein_coding 173822 204032 . - . ID=rna-XM_050545848.1;Parent=gene-LOC126818000;Dbxref=GeneID:126818000,GenBank:XM_050545848.1;Name=XM_050545848.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=mRNA;gene=LOC126818000;model_evidence=Supporting evidence includes similarity to: 27 Proteins;product=serine/threonine-protein kinase 25%2C transcript variant X9;transcript_id=XM_050545848.1 203736 204032 . - . ID=exon-XM_050545848.1-1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XM_050545848.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=mRNA;gene=LOC126818000;product=serine/threonine-protein kinase 25%2C transcript variant X9;transcript_id=XM_050545848.1 198056 198289 . - . ID=exon-XM_050545848.1-2;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XM_050545848.1;experiment=COORDINATES: polyA evidence [ECO:0006239];gbkey=mRNA;gene=LOC126818000;product=serine/threonine-protein kinase 25%2C transcript variant X9;transcript_id=XM_050545848.1

jl02142 commented 6 months ago

Does the gff file have header lines at the start of the file, lines that start with the "#"? If so, can you copy and paste at least one of those lines?

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build ASM584v2

!genome-build-accession NCBI_Assembly:GCF_000005845.2

sequence-region NC_000913.3 1 4641652

Does "CDS" ever appear in the 3rd column where "gene", "lnc_RNA", and "exon" appear? If so, can you copy and paste at least one of those lines?

NC_000913.3 RefSeq CDS 190 255 . + 0 ID=cds-NP_414542.1;Parent=gene-b0001;Dbxref=UniProtKB/Swiss-Prot:P0AD86,GenBank:NP_414542.1,ASAP:ABE-0000006,ECOCYC:EG11277,GeneID:944742;Name=NP_414542.1;gbkey=CDS;gene=thrL;locus_tag=b0001;orig_transcript_id=gnl|b0001|mrna.NP_414542;product=thr operon leader peptide;protein_id=NP_414542.1;transl_table=11

Can check with this command

cat GCF_Patellavulgata_genomic.gff | cut -f3 | grep -c "CDS"

It appears the file does not contain "locus_tag=" - which is not a problem. We should be able to use the "ID=gene-" part of the line, but let me know if "locus_tag=" is present or not in the gff.

NC_000913.3 RefSeq CDS 3720049 3720261 . + 0 ID=cds-NP_418012.1;Parent=gene-b3556;Dbxref=UniProtKB/Swiss-Prot:P0A9X9,GenBank:NP_418012.1,ASAP:ABE-0011616,ECOCYC:EG10166,GeneID:948070;Name=NP_418012.1;gbkey=CDS;gene=cspA;locus_tag=b3556;orig_transcript_id=gnl|b3556|mrna.NP_418012;product=cold shock protein CspA;protein_id=NP_418012.1;transl_table=11

Can check with this command

grep -m 10 "locus_tag=" GCF_Patellavulgata_genomic.gff

ZhaoyanZhong commented 6 months ago

Header lines that start with the "#" are:

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build xgPatVulg1.2

!genome-build-accession NCBI_Assembly:GCF_932274485.2

!annotation-date 05/12/2023

!annotation-source NCBI RefSeq GCF_932274485.2-RS_2023_05

sequence-region NC_065879.2 1 95737551

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6465
"CDS" appear in the 3rd column: NC_065879.2 Gnomon CDS 203736 203756 . - 0 ID=cds-XP_050401805.1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XP_050401805.1;Name=XP _050401805.1;gbkey=CDS;gene=LOC126818000;product=serine/threonine-protein kinase 26 isoform X8;protein_id=XP_050401805.1 NC_065879.2 Gnomon CDS 198056 198289 . - 0 ID=cds-XP_050401805.1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XP_050401805.1;Name=XP _050401805.1;gbkey=CDS;gene=LOC126818000;product=serine/threonine-protein kinase 26 isoform X8;protein_id=XP_050401805.1 NC_065879.2 Gnomon CDS 195530 195695 . - 0 ID=cds-XP_050401805.1;Parent=rna-XM_050545848.1;Dbxref=GeneID:126818000,GenBank:XP_050401805.1;Name=XP _050401805.1;gbkey=CDS;gene=LOC126818000;product=serine/threonine-protein kinase 26 isoform X8;protein_id=XP_050401805.1
The GFF file does not contain "locus_tag=", but it contains "ID=gene-"

SamLlm commented 6 months ago

Hello,

I also have a problem with the error : ''Error feature table file missing". I tried to apply all the advices you provided here such as :

-add 'GCF_' prefix and 'genomic.gff' suffix on gff files (example for one of my file : GCF_Homo_sapiens.GRCh38.110_genomic.gff )

-use branch version ('2024 version' according to the second line of the code)

-my input text files is named 'input.txt' and contains gff files name with prefix removed, like this : 'GCF_Homo_sapiens.GRCh38.110'

Finally, that's how my command looks like : $myspace/scripts/all_Orthorefine/OrthoRefine_gff_compatible/orthorefine.exe --input input.txt --OF_file $myspace/data/Orthorefine/test/OrthoFinder/Results_mai30/Phylogenetic_Hierarchical_Orthogroups/N0.tsv --window_size 8 --synteny_ratio 0.5

Thanks in advance for your help.

jl02142 commented 6 months ago

@SamLlm If the data is publicly available (can be downloaded from NCBI or elsewhere), post the input.txt contents (the genome accessions) or send it to me in an email. If not, let me know and I'll find a way to safely share it. I was trying to troubleshoot this with gff files I obtained from NCBI, but have been unable to recreate the error.

@ZhaoyanZhong I believe I have a fix; however, I need to test it. I'm also working on editing the isoform removal script for gff files as the redundant information will cause OrthoRefine to work less accurately.

SamLlm commented 6 months ago

@jl02142 yes I would like to send it to you because it is not available on NCBI. Can you provide me an email I could send it to ?

jl02142 commented 5 months ago

jmludwig@uga.edu

jl02142 commented 5 months ago

As an update on the "Error feature table file missing",

The file names need a certain format for OrthoRefine to pattern match against. For example, if you are using human and mouse the input file would be:

GCF_Homo.sapiens
GCF_Mus.musculus

and the files in the directory would be named:

GCF_Homo.sapiens_GRCh38.pep.all.fa
GCF_Mus.musculus_GRCm39.pep.all.fa
GCF_Mus.musculus_GRCm39.112.genomic.gff
GCF_Homo.sapiens_GRCh38.112.genomic.gff

The pattern needs to be GCF**.genomic.gff where all the text before the second underscore needs to be the text in the input file.

jl02142 commented 5 months ago

I've pushed an update to OrthoRefine.cpp on the test branch that will resolve some issues.

If "locus_tag" is not present in GFF file, try "Name" and then try "protein_id" instead. Handle if some fields in the GFF have "cds:" which prevented pattern matching.

jl02142 / OrthoRefine

Use of GFF format as input? #1

Example gff from FunAnnotate funannotate util gff-rename

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build ASM584v2

!genome-build-accession NCBI_Assembly:GCF_000005845.2

sequence-region NC_000913.3 1 4641652

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build xgPatVulg1.2

!genome-build-accession NCBI_Assembly:GCF_932274485.2

!annotation-date 05/12/2023

!annotation-source NCBI RefSeq GCF_932274485.2-RS_2023_05

sequence-region NC_065879.2 1 95737551

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6465