Closed nrockweiler closed 1 year ago
The UCSC browser GFF3 parse can't parse this either; it is invalid.
Hello @nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1.
@diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.
Let us know in case there are any other issues!
Best, Arang
Wonderful! Thank you so much.
On Thu, Jul 6, 2023, 2:05 PM Arang Rhie @.***> wrote:
Hello @nrockweiler https://github.com/nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1 https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz .
@diekhans https://github.com/diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.
Let us know in case there are any other issues!
Best, Arang
— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/82#issuecomment-1624108873, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAJWBJL3HPSY4MZ3CRQF6LXO345PANCNFSM6AAAAAAX77IQVE . You are receiving this because you were mentioned.Message ID: @.***>
Hi,
I believe there a fair number of gff3 validation issues with the recent update of JHU RefSeqv110 + Liftoff v5 in in commit dafcf67.
I've been using the GenomeTools gff3validator tool to find these issues. Below is a summary of the issues:
1115 records have an odd
The
key in the attributes column, e.g.,MIR3690_1
is a PAR gene and is on bothchrX
andchrY
. To follow the convention for other PAR genes, I think the copy onchrX
should be renamedMIR3690
There is more than 1 ID element on line 3999636 (the IDs are
NM_001320962.1
andTSPY10P
):80 records have a malformed key-value pair in the attributes column; the "key" is called
IDNM*
and there is no value. I think this is supposed to beID=NM*
, e.g.:While it didn't come up as a validation issue, I saw a lot of text where I thought it would be ascii characters, but it looked like maybe hex encodings, e.g.,
GeneID:100289087%2C
,testis specific protein Y-linked 10%252C transcript variant
etc. Maybe this has something to do with the mention ofcorrect[ing the] special character issues from the original file
in theREADME
?Thank you! Nicole