legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

arahy.Tifrunner.gnm2.ann2.PVFB.gene_models_main.gff3.gz is funky - different sequences from assembly? #174

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago

The genome assembly uses

>arahy.Tifrunner.gnm2.Arahy.01
...
>arahy.Tifrunner.gnm2.scaffold_821
...

while the ann2 GFF uses

arahy.Tifrunner.gnm2.chr01
...
arahy.Tifrunner.gnm2.scaffold_108
...

Shall I just rename the sequences in the GFF?

sammyjava commented 1 year ago

Oh, actually it's worse - there is no scaffold_108 in the genome assembly. Something is amiss with the ann2 annotation for gnm2.

adf-ncgr commented 1 year ago

are you sure your files are up to date wrt what's in the datastore? /usr/local/www/data/v2/Arachis/hypogaea/genomes/Tifrunner.gnm2.J5K5/arahy.Tifrunner.gnm2.J5K5.genome_main.fna.gz.fai has the arahy.Tifrunner.gnm2.chr01 naming (which I believe @StevenCannon-USDA announced he was going to impose on it some time ago).

sammyjava commented 1 year ago

(Fortunately, the newish chromosome_prefix and supercontig_prefix entries in the genome README lead to an error being thrown on this GFF load.)

sammyjava commented 1 year ago

I was up to date. I'll hit @StevenCannon-USDA on this one, he forgot to update this annotation and now must pay the price. Apparently the scaffolds got named differently in the genome assembly.

[shokin@peanutbase-stage ~/v2/Arachis/hypogaea/annotations/Tifrunner.gnm2.ann2.PVFB]$ zcat arahy.Tifrunner.gnm2.ann2.PVFB.gene_models_main.gff3.gz | cut -f1 | uniq
##gff-version 3
arahy.Tifrunner.gnm2.chr01
arahy.Tifrunner.gnm2.chr02
arahy.Tifrunner.gnm2.chr03
arahy.Tifrunner.gnm2.chr04
arahy.Tifrunner.gnm2.chr05
arahy.Tifrunner.gnm2.chr06
arahy.Tifrunner.gnm2.chr07
arahy.Tifrunner.gnm2.chr08
arahy.Tifrunner.gnm2.chr09
arahy.Tifrunner.gnm2.chr10
arahy.Tifrunner.gnm2.chr11
arahy.Tifrunner.gnm2.chr12
arahy.Tifrunner.gnm2.chr13
arahy.Tifrunner.gnm2.chr14
arahy.Tifrunner.gnm2.chr15
arahy.Tifrunner.gnm2.chr16
arahy.Tifrunner.gnm2.chr17
arahy.Tifrunner.gnm2.chr18
arahy.Tifrunner.gnm2.chr19
arahy.Tifrunner.gnm2.chr20
arahy.Tifrunner.gnm2.scaffold_108
arahy.Tifrunner.gnm2.scaffold_110
arahy.Tifrunner.gnm2.scaffold_113
arahy.Tifrunner.gnm2.scaffold_115
arahy.Tifrunner.gnm2.scaffold_116
arahy.Tifrunner.gnm2.scaffold_120
arahy.Tifrunner.gnm2.scaffold_131
arahy.Tifrunner.gnm2.scaffold_132
arahy.Tifrunner.gnm2.scaffold_134
arahy.Tifrunner.gnm2.scaffold_136
arahy.Tifrunner.gnm2.scaffold_139
arahy.Tifrunner.gnm2.scaffold_142
arahy.Tifrunner.gnm2.scaffold_146
arahy.Tifrunner.gnm2.scaffold_148
arahy.Tifrunner.gnm2.scaffold_154
arahy.Tifrunner.gnm2.scaffold_168
arahy.Tifrunner.gnm2.scaffold_170
arahy.Tifrunner.gnm2.scaffold_172
arahy.Tifrunner.gnm2.scaffold_174
arahy.Tifrunner.gnm2.scaffold_175
arahy.Tifrunner.gnm2.scaffold_184
arahy.Tifrunner.gnm2.scaffold_193
arahy.Tifrunner.gnm2.scaffold_194
arahy.Tifrunner.gnm2.scaffold_197
arahy.Tifrunner.gnm2.scaffold_204
arahy.Tifrunner.gnm2.scaffold_21
arahy.Tifrunner.gnm2.scaffold_211
arahy.Tifrunner.gnm2.scaffold_214
arahy.Tifrunner.gnm2.scaffold_216
arahy.Tifrunner.gnm2.scaffold_22
arahy.Tifrunner.gnm2.scaffold_222
arahy.Tifrunner.gnm2.scaffold_228
arahy.Tifrunner.gnm2.scaffold_23
arahy.Tifrunner.gnm2.scaffold_231
arahy.Tifrunner.gnm2.scaffold_234
arahy.Tifrunner.gnm2.scaffold_239
arahy.Tifrunner.gnm2.scaffold_240
arahy.Tifrunner.gnm2.scaffold_25
arahy.Tifrunner.gnm2.scaffold_268
arahy.Tifrunner.gnm2.scaffold_27
arahy.Tifrunner.gnm2.scaffold_275
arahy.Tifrunner.gnm2.scaffold_285
arahy.Tifrunner.gnm2.scaffold_288
arahy.Tifrunner.gnm2.scaffold_291
arahy.Tifrunner.gnm2.scaffold_293
arahy.Tifrunner.gnm2.scaffold_297
arahy.Tifrunner.gnm2.scaffold_301
arahy.Tifrunner.gnm2.scaffold_303
arahy.Tifrunner.gnm2.scaffold_305
arahy.Tifrunner.gnm2.scaffold_306
arahy.Tifrunner.gnm2.scaffold_307
arahy.Tifrunner.gnm2.scaffold_31
arahy.Tifrunner.gnm2.scaffold_311
arahy.Tifrunner.gnm2.scaffold_312
arahy.Tifrunner.gnm2.scaffold_318
arahy.Tifrunner.gnm2.scaffold_32
arahy.Tifrunner.gnm2.scaffold_327
arahy.Tifrunner.gnm2.scaffold_33
arahy.Tifrunner.gnm2.scaffold_330
arahy.Tifrunner.gnm2.scaffold_343
arahy.Tifrunner.gnm2.scaffold_350
arahy.Tifrunner.gnm2.scaffold_352
arahy.Tifrunner.gnm2.scaffold_36
arahy.Tifrunner.gnm2.scaffold_362
arahy.Tifrunner.gnm2.scaffold_364
arahy.Tifrunner.gnm2.scaffold_386
arahy.Tifrunner.gnm2.scaffold_42
arahy.Tifrunner.gnm2.scaffold_43
arahy.Tifrunner.gnm2.scaffold_44
arahy.Tifrunner.gnm2.scaffold_446
arahy.Tifrunner.gnm2.scaffold_45
arahy.Tifrunner.gnm2.scaffold_47
arahy.Tifrunner.gnm2.scaffold_477
arahy.Tifrunner.gnm2.scaffold_49
arahy.Tifrunner.gnm2.scaffold_496
arahy.Tifrunner.gnm2.scaffold_498
arahy.Tifrunner.gnm2.scaffold_50
arahy.Tifrunner.gnm2.scaffold_502
arahy.Tifrunner.gnm2.scaffold_504
arahy.Tifrunner.gnm2.scaffold_506
arahy.Tifrunner.gnm2.scaffold_51
arahy.Tifrunner.gnm2.scaffold_528
arahy.Tifrunner.gnm2.scaffold_530
arahy.Tifrunner.gnm2.scaffold_54
arahy.Tifrunner.gnm2.scaffold_55
arahy.Tifrunner.gnm2.scaffold_557
arahy.Tifrunner.gnm2.scaffold_576
arahy.Tifrunner.gnm2.scaffold_579
arahy.Tifrunner.gnm2.scaffold_59
arahy.Tifrunner.gnm2.scaffold_602
arahy.Tifrunner.gnm2.scaffold_610
arahy.Tifrunner.gnm2.scaffold_614
arahy.Tifrunner.gnm2.scaffold_62
arahy.Tifrunner.gnm2.scaffold_63
arahy.Tifrunner.gnm2.scaffold_66
arahy.Tifrunner.gnm2.scaffold_669
arahy.Tifrunner.gnm2.scaffold_68
arahy.Tifrunner.gnm2.scaffold_69
arahy.Tifrunner.gnm2.scaffold_703
arahy.Tifrunner.gnm2.scaffold_76
arahy.Tifrunner.gnm2.scaffold_77
arahy.Tifrunner.gnm2.scaffold_78
arahy.Tifrunner.gnm2.scaffold_797
arahy.Tifrunner.gnm2.scaffold_83
arahy.Tifrunner.gnm2.scaffold_830
arahy.Tifrunner.gnm2.scaffold_84
arahy.Tifrunner.gnm2.scaffold_853
arahy.Tifrunner.gnm2.scaffold_856
arahy.Tifrunner.gnm2.scaffold_87
arahy.Tifrunner.gnm2.scaffold_872
arahy.Tifrunner.gnm2.scaffold_878
arahy.Tifrunner.gnm2.scaffold_91
arahy.Tifrunner.gnm2.scaffold_92
arahy.Tifrunner.gnm2.scaffold_93
arahy.Tifrunner.gnm2.scaffold_95
arahy.Tifrunner.gnm2.scaffold_98
arahy.Tifrunner.gnm2.scaffold_99
sammyjava commented 1 year ago

are you sure your files are up to date wrt what's in the datastore? /usr/local/www/data/v2/Arachis/hypogaea/genomes/Tifrunner.gnm2.J5K5/arahy.Tifrunner.gnm2.J5K5.genome_main.fna.gz.fai has the arahy.Tifrunner.gnm2.chr01 naming (which I believe @StevenCannon-USDA announced he was going to impose on it some time ago).

Oh, didn't read what you said, carefully. The genome collection has Arahy.01 etc. So if the annotation was updated without updating the genome that's even worse. In any case, the genomes define the chromosome names. The annotations must comply. This one doesn't. It's the only one that doesn't use Arahy.01 etc.

adf-ncgr commented 1 year ago

I'm still not following you. I see:

zgrep '^>' /usr/local/www/data/v2/Arachis/hypogaea/genomes/Tifrunner.gnm2.J5K5/arahy.Tifrunner.gnm2.J5K5.genome_main.fna.gz | head -10
>arahy.Tifrunner.gnm2.chr01
>arahy.Tifrunner.gnm2.chr02
>arahy.Tifrunner.gnm2.chr03
>arahy.Tifrunner.gnm2.chr04
>arahy.Tifrunner.gnm2.chr05
>arahy.Tifrunner.gnm2.chr06
>arahy.Tifrunner.gnm2.chr07
>arahy.Tifrunner.gnm2.chr08
>arahy.Tifrunner.gnm2.chr09
>arahy.Tifrunner.gnm2.chr10
...
zgrep -v '^#' /usr/local/www/data/v2/Arachis/hypogaea/annotations/Tifrunner.gnm2.ann2.PVFB/arahy.Tifrunner.gnm2.ann2.PVFB.gene_models_main.gff3.gz | awk '{print $1}' | uniq | head -10
arahy.Tifrunner.gnm2.chr01
arahy.Tifrunner.gnm2.chr02
arahy.Tifrunner.gnm2.chr03
arahy.Tifrunner.gnm2.chr04
arahy.Tifrunner.gnm2.chr05
arahy.Tifrunner.gnm2.chr06
arahy.Tifrunner.gnm2.chr07
arahy.Tifrunner.gnm2.chr08
arahy.Tifrunner.gnm2.chr09
arahy.Tifrunner.gnm2.chr10

which seems at least consistent and what I think @StevenCannon-USDA intended for the new naming convention. What am I missing?

sammyjava commented 1 year ago

Nothing. I was behind on the GENOME assembly not the annotation. Never mind. Starting over with legumemine....

sammyjava commented 1 year ago

Actually I can blow away the Arachis chromosomes and supercontigs without a rebuild, I think.

sammyjava commented 1 year ago

And since the README has the up-to-date prefixes (one hopes), I should have taken that as a clue for my out-of-date files in the first place. That's actually a nice thing to have in the repo for catching this sort of thing if I'd had my head on straight.