Closed rac266 closed 1 year ago
I'm not really sure whats going on here. I'd use parse_annotations(..., troubleshoot = T)
to see whats going on. My guess is your parameters are not right.
parse_annotations
is a convenience function ... if it isn't working for you, you should feel free to just parse the gff to bed files in any way you choose. Just store the new parsed files in the /peptide and /bed directories in your GENESPACE working directory.
Okay, I generated my own bed file and stored them in the bed directory:
Example of one bedfile:
1 0 134019835 region:1 1 45362 45561 . 1 45362 45561 . 1 45362 45933 . 1 45362 45933 . 1 45362 55328 transcript:ENSAMXT00025000532 1 45362 56004 gene:ENSAMXG00025000236 1 45362 56004 transcript:ENSAMXT00025000534 1 45363 45561 .
However, when running genespace it ran into the following error:
library(GENESPACE) gpar <- init_genespace(
- wd = "/path/to/workingDir",
- path2mcscanx = /path/to/MCScanX") Checking Working Directory ... PASS:
/path/to/workingDir
Checking user-defined parameters ... Genome IDs & ploidy ... species1_AstMex3_surface_GCA_023375975.1: 1 species2_AstMex3_Molino_GCA_023375845.1 : 1 species3_AstMex3_Tinaja_GCA_023375835.1 : 1 species4_AMEX_1.1_GCA_019721115.1 : 1 Outgroup ... NONE n. parallel processes ... 14 collinear block size ... 5 collinear block search radius ... 25 n gaps in collinear block ... 5 synteny buffer size... 100 only orthogroups hits as anchors ... TRUE n secondary hits ... 0 Checking annotation files (.bed and peptide .fa): species1_AstMex3_surface_GCA_023375975.1: 0 / 302884 geneIDs exactly match (FAIL) Error in FUN(X[[i]], ...) : some genes have ':' in the names. This string cannot be in gene names as it is used as the dictionary separator by OrthoFinder
Is this related to the lack of matches made above when trying to parse the gff3 file since none of the geneIDs are matching?
your error is species1_AstMex3_surface_GCA_023375975.1: 0 / 302884 geneIDs exactly match (FAIL)
... it looks like you have way too many CDSs (302k!) and that none of those have the id column (4th) match exactly the peptide fasta file headers. Maybe you didn't parse the fa headers?
Ah yes you're right - I only parsed the bed files. Below are what the start of my files look like now:
head species1_AstMex3_surface_GCA_023375975.1.bed 1 45362 45561 . 1 45362 45561 . 1 45362 45933 . 1 45362 45933 . 1 45362 55328 transcript:ENSAMXT00025000532 1 45362 56004 gene:ENSAMXG00025000236 1 45362 56004 transcript:ENSAMXT00025000534 1 45363 45561 . 1 45363 45933 . 1 45363 56004 transcript:ENSAMXT00025000533
head species1_AstMex3_surface_GCA_023375975.1.fa
ENSAMXP00025000188.1 CVRHDSHQVFGAFSSDPFRVSNSCYGTGETFLFTFDPEFKTFCWSGENSYFVRGFLDSLQ MGGGGGGPFGLWLDADLCRGSTFSCNTFRNQPLSPLQDFTIRAVEVWTFL ENSAMXP00025000190.1 MRPILQNIQILYVSNGGEEPYVEIITVRKSARSRSVCSSRSVCSSRSVCSSRSARSVCSS WSECTSEDEEAEDTLPVLNQKSSILQEHHIHSLAGGVPARVQGTVWELVYSTAEHGTSLR TLYRKTAPIDRPVLLLIQDIHNQVSQCISAYITV ENSAMXP00025000193.1 MVIRSEVLVLVLLLGSAVRSFKLDELIKYVDSHEEEYVQALREWVAVESDSSDVTKREEL HRMMDMVAEKLRKIKGTVELVDIGSQQLPDGSSLALPKVVVAQFGSDTNKATVCVYGHVD
Both files seem to match the parsed format on the documentation page but it is still throwing a matching error:
Error in check_annotFiles(filepath = wd, genomeIDs = gids) : some annotations do not have matching peptide headers and bed names
It looks like you haven't fully parsed the headers or the id columns. id has gene:
preceding the gene IDs. The peptide fasta headers have .1
after the names.
I'm going to close this, but feel free to re-open if this didn't solve your issue. John
Hello -
I am having issues with running parsed_annotations on my 4 ensembl pep.fa and gff3 files. Any help or advice is appreciated. Thanks for creating this package!
The bed and pep directories with the subsequent files are made, however, the files themselves are empty. I also tried following a preview ensembl parsed file issue and still had the same output as above:
Below are the examples of my file headers in case there are any glaring issues there.
Example header of 1 pep.fa file:
example header of corresponding gff3 file:
sequence-region JALAHU010000101.1 1 26141
sequence-region JALAHU010000102.1 1 25757
sequence-region JALAHU010000103.1 1 24579
sequence-region JALAHU010000104.1 1 20700
!genome-build AstMex3_surface
!genome-version AstMex3_surface
!genome-date 2022-05
!genome-build-accession GCA_023375975.1
!genebuild-last-updated 2022-07
1 AstMex3_surface region 1 134019835 . . . ID=region:1;Alias=CM041909.1
1 ensembl gene 45363 56004 . + . ID=gene:ENSAMXG00025000236;Name=SLC25A35;biotype=protein_coding;description=solute carrier family 25 member 35 [Ensembl NN prediction with score 100%25];gene_id=ENSAMXG00025000236;version=1 1 ensembl mRNA 45363 55328 . + . ID=transcript:ENSAMXT00025000532;Parent=gene:ENSAMXG00025000236;biotype=protein_coding;transcript_id=ENSAMXT00025000532;version=1 1 ensembl five_prime_UTR 45363 45561 . + . Parent=transcript:ENSAMXT00025000532 1 ensembl exon 45363 45933 . + . Parent=transcript:ENSAMXT00025000532;constitutive=0;exon_id=ENSAMXE00025002451;rank=1;version=1 1 ensembl CDS 45562 45933 . + 0 ID=CDS:ENSAMXP00025000404;Parent=transcript:ENSAMXT00025000532;protein_id=ENSAMXP00025000404;version=1 1 ensembl exon 52981 53133 . + . Parent=transcript:ENSAMXT00025000532;constitutive=1;exon_id=ENSAMXE00025002452;rank=2;version=1 1 ensembl CDS 52981 53133 . + 0 ID=CDS:ENSAMXP00025000404;Parent=transcript:ENSAMXT00025000532;protein_id=ENSAMXP00025000404;version=1 1 ensembl exon 54101 54235 . + . Parent=transcript:ENSAMXT00025000532;constitutive=1;exon_id=ENSAMXE00025002453;rank=3;version=