rac266 commented 1 year ago

Hello -

I am having issues with running parsed_annotations on my 4 ensembl pep.fa and gff3 files. Any help or advice is appreciated. Thanks for creating this package!

library(GENESPACE) parsedPaths <- parse_annotations(

rawGenomeRepo = "/path/to/genomeRepo",

genomeDirs = c("species1_AstMex3_surface_GCA_023375975.1", "species2_AstMex3_Molino_GCA_023375845.1", "species3_AstMex3_Tinaja_GCA_023375835.1", "species4_AMEX_1.1_GCA_019721115.1"),

genomeIDs = c("species1_AstMex3_surface_GCA_023375975.1", "species2_AstMex3_Molino_GCA_023375845.1", "species3_AstMex3_Tinaja_GCA_023375835.1", "species4_AMEX_1.1_GCA_019721115.1"),

gffString = "gff3",

faString = "fa",

presets = "ncbi",

genespaceWd = "/path/to/workingDir") species1_AstMex3_surface_GCA_023375975.1: n unique sequences = 1, n matched to gff = 0 species2_AstMex3_Molino_GCA_023375845.1: n unique sequences = 1, n matched to gff = 0 species3_AstMex3_Tinaja_GCA_023375835.1: n unique sequences = 1, n matched to gff = 0 species4_AMEX_1.1_GCA_019721115.1: n unique sequences = 1, n matched to gff = 0

The bed and pep directories with the subsequent files are made, however, the files themselves are empty. I also tried following a preview ensembl parsed file issue and still had the same output as above:

library(GENESPACE) parsedPaths <- parse_annotations(

rawGenomeRepo = "/path/to/genomeRepo",

genomeDirs = c("species1_AstMex3_surface_GCA_023375975.1", "species2_AstMex3_Molino_GCA_023375845.1", "species3_AstMex3_Tinaja_GCA_023375835.1", "species4_AMEX_1.1_GCA_019721115.1"),

genomeIDs = c("species1_AstMex3_surface_GCA_023375975.1", "species2_AstMex3_Molino_GCA_023375845.1", "species3_AstMex3_Tinaja_GCA_023375835.1", "species4_AMEX_1.1_GCA_019721115.1"),

headerStripText = "gene:", gffIdColumn = "ID", gffStripText = "gene:", genespaceWd = "/path/to/workingDir") species1_AstMex3_surface_GCA_023375975.1: n unique sequences = 27065, n matched to gff = 0 species2_AstMex3_Molino_GCA_023375845.1: n unique sequences = 26293, n matched to gff = 0 species3_AstMex3_Tinaja_GCA_023375835.1: n unique sequences = 27295, n matched to gff = 0 species4_AMEX_1.1_GCA_019721115.1: n unique sequences = 27244, n matched to gff = 0

Below are the examples of my file headers in case there are any glaring issues there.

Example header of 1 pep.fa file:

ENSAMXP00025000188.1 pep AstMex3_surface:1:130817028:130818689:-1 gene:ENSAMXG00025000109.1 transcript:ENSAMXT00025000261.1 gene_biotype:protein_coding transcript_biotype:protein_coding CVRHDSHQVFGAFSSDPFRVSNSCYGTGETFLFTFDPEFKTFCWSGENSYFVRGFLDSLQ MGGGGGGPFGLWLDADLCRGSTFSCNTFRNQPLSPLQDFTIRAVEVWTFL ENSAMXP00025000190.1 pep AstMex3_surface:1:130819784:130821904:-1 gene:ENSAMXG00025000110.1 transcript:ENSAMXT00025000263.1 gene_biotype:protein_coding transcript_biotype:protein_coding

example header of corresponding gff3 file:

sequence-region JALAHU010000101.1 1 26141

sequence-region JALAHU010000102.1 1 25757

sequence-region JALAHU010000103.1 1 24579

sequence-region JALAHU010000104.1 1 20700

!genome-build AstMex3_surface

!genome-version AstMex3_surface

!genome-date 2022-05

!genome-build-accession GCA_023375975.1

!genebuild-last-updated 2022-07

1 AstMex3_surface region 1 134019835 . . . ID=region:1;Alias=CM041909.1

1 ensembl gene 45363 56004 . + . ID=gene:ENSAMXG00025000236;Name=SLC25A35;biotype=protein_coding;description=solute carrier family 25 member 35 [Ensembl NN prediction with score 100%25];gene_id=ENSAMXG00025000236;version=1 1 ensembl mRNA 45363 55328 . + . ID=transcript:ENSAMXT00025000532;Parent=gene:ENSAMXG00025000236;biotype=protein_coding;transcript_id=ENSAMXT00025000532;version=1 1 ensembl five_prime_UTR 45363 45561 . + . Parent=transcript:ENSAMXT00025000532 1 ensembl exon 45363 45933 . + . Parent=transcript:ENSAMXT00025000532;constitutive=0;exon_id=ENSAMXE00025002451;rank=1;version=1 1 ensembl CDS 45562 45933 . + 0 ID=CDS:ENSAMXP00025000404;Parent=transcript:ENSAMXT00025000532;protein_id=ENSAMXP00025000404;version=1 1 ensembl exon 52981 53133 . + . Parent=transcript:ENSAMXT00025000532;constitutive=1;exon_id=ENSAMXE00025002452;rank=2;version=1 1 ensembl CDS 52981 53133 . + 0 ID=CDS:ENSAMXP00025000404;Parent=transcript:ENSAMXT00025000532;protein_id=ENSAMXP00025000404;version=1 1 ensembl exon 54101 54235 . + . Parent=transcript:ENSAMXT00025000532;constitutive=1;exon_id=ENSAMXE00025002453;rank=3;version=

jtlovell commented 1 year ago

I'm not really sure whats going on here. I'd use parse_annotations(..., troubleshoot = T) to see whats going on. My guess is your parameters are not right. parse_annotations is a convenience function ... if it isn't working for you, you should feel free to just parse the gff to bed files in any way you choose. Just store the new parsed files in the /peptide and /bed directories in your GENESPACE working directory.

rac266 commented 1 year ago

Okay, I generated my own bed file and stored them in the bed directory:

Example of one bedfile:

1 0 134019835 region:1 1 45362 45561 . 1 45362 45561 . 1 45362 45933 . 1 45362 45933 . 1 45362 55328 transcript:ENSAMXT00025000532 1 45362 56004 gene:ENSAMXG00025000236 1 45362 56004 transcript:ENSAMXT00025000534 1 45363 45561 .

However, when running genespace it ran into the following error:

library(GENESPACE) gpar <- init_genespace(

wd = "/path/to/workingDir",

path2mcscanx = /path/to/MCScanX") Checking Working Directory ... PASS: /path/to/workingDir Checking user-defined parameters ... Genome IDs & ploidy ... species1_AstMex3_surface_GCA_023375975.1: 1 species2_AstMex3_Molino_GCA_023375845.1 : 1 species3_AstMex3_Tinaja_GCA_023375835.1 : 1 species4_AMEX_1.1_GCA_019721115.1 : 1 Outgroup ... NONE n. parallel processes ... 14 collinear block size ... 5 collinear block search radius ... 25 n gaps in collinear block ... 5 synteny buffer size... 100 only orthogroups hits as anchors ... TRUE n secondary hits ... 0 Checking annotation files (.bed and peptide .fa): species1_AstMex3_surface_GCA_023375975.1: 0 / 302884 geneIDs exactly match (FAIL) Error in FUN(X[[i]], ...) : some genes have ':' in the names. This string cannot be in gene names as it is used as the dictionary separator by OrthoFinder

Is this related to the lack of matches made above when trying to parse the gff3 file since none of the geneIDs are matching?

jtlovell commented 1 year ago

your error is species1_AstMex3_surface_GCA_023375975.1: 0 / 302884 geneIDs exactly match (FAIL) ... it looks like you have way too many CDSs (302k!) and that none of those have the id column (4th) match exactly the peptide fasta file headers. Maybe you didn't parse the fa headers?

rac266 commented 1 year ago

Ah yes you're right - I only parsed the bed files. Below are what the start of my files look like now:

head species1_AstMex3_surface_GCA_023375975.1.bed 1 45362 45561 . 1 45362 45561 . 1 45362 45933 . 1 45362 45933 . 1 45362 55328 transcript:ENSAMXT00025000532 1 45362 56004 gene:ENSAMXG00025000236 1 45362 56004 transcript:ENSAMXT00025000534 1 45363 45561 . 1 45363 45933 . 1 45363 56004 transcript:ENSAMXT00025000533

head species1_AstMex3_surface_GCA_023375975.1.fa

ENSAMXP00025000188.1 CVRHDSHQVFGAFSSDPFRVSNSCYGTGETFLFTFDPEFKTFCWSGENSYFVRGFLDSLQ MGGGGGGPFGLWLDADLCRGSTFSCNTFRNQPLSPLQDFTIRAVEVWTFL ENSAMXP00025000190.1 MRPILQNIQILYVSNGGEEPYVEIITVRKSARSRSVCSSRSVCSSRSVCSSRSARSVCSS WSECTSEDEEAEDTLPVLNQKSSILQEHHIHSLAGGVPARVQGTVWELVYSTAEHGTSLR TLYRKTAPIDRPVLLLIQDIHNQVSQCISAYITV ENSAMXP00025000193.1 MVIRSEVLVLVLLLGSAVRSFKLDELIKYVDSHEEEYVQALREWVAVESDSSDVTKREEL HRMMDMVAEKLRKIKGTVELVDIGSQQLPDGSSLALPKVVVAQFGSDTNKATVCVYGHVD

Both files seem to match the parsed format on the documentation page but it is still throwing a matching error:

Error in check_annotFiles(filepath = wd, genomeIDs = gids) : some annotations do not have matching peptide headers and bed names

jtlovell commented 1 year ago

It looks like you haven't fully parsed the headers or the id columns. id has gene: preceding the gene IDs. The peptide fasta headers have .1 after the names.

jtlovell commented 1 year ago

I'm going to close this, but feel free to re-open if this didn't solve your issue. John

jtlovell / GENESPACE

parsed annotations for ensembl files #79

sequence-region JALAHU010000101.1 1 26141

sequence-region JALAHU010000102.1 1 25757

sequence-region JALAHU010000103.1 1 24579

sequence-region JALAHU010000104.1 1 20700

!genome-build AstMex3_surface

!genome-version AstMex3_surface

!genome-date 2022-05

!genome-build-accession GCA_023375975.1

!genebuild-last-updated 2022-07