Latest JHU RefSeqv110 + Liftoff v5 gff3 does not pass gff3 validation

nrockweiler commented 1 year ago

Hi,

I believe there a fair number of gff3 validation issues with the recent update of JHU RefSeqv110 + Liftoff v5 in in commit dafcf67.

I've been using the GenomeTools gff3validator tool to find these issues. Below is a summary of the issues:

1115 records have an odd The key in the attributes column, e.g.,

$ grep -n -m 1 "The=" chm13v2.0_RefSeq_Liftoff_v5.gff3
80912:chr1  Liftoff CDS 25137221    25137356    .   +   0   Parent=NM_001282867.1;db_xref=GeneID:6007;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=RHD;inference=similar to AA sequence (same species):RefSeq:NP_001269796.1;note=isoform 3 is encoded by transcript variant 3;The=RefSeq protein has 1 substitution compared to this genomic sequence;product=blood group Rh(D) polypeptide isoform 3;protein_id=NP_001269796.1;exon_number=4;extra_copy_number=0

MIR3690_1 is a PAR gene and is on both chrX and chrY. To follow the convention for other PAR genes, I think the copy on chrX should be renamed MIR3690
```
$ grep -w "ID=MIR3690_1" chm13v2.0_RefSeq_Liftoff_v5.gff3 | cut -f 1
chrX
chrY
```

There is more than 1 ID element on line 3999636 (the IDs are NM_001320962.1 and TSPY10P):

$ grep -n -w "ID=NM_001320962.1;ID=TSPY10P" chm13v2.0_RefSeq_Liftoff_v5.gff3
3999636:chrY    Liftoff transcript  9795914 9798710 .   +   .   ID=NM_001320962.1;ID=TSPY10P;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;Name=NM_001320962.1;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;matches_ref_protein=False;valid_ORF=False;inframe_stop_codon=True;extra_copy_number=0

80 records have a malformed key-value pair in the attributes column; the "key" is called IDNM* and there is no value. I think this is supposed to be ID=NM*, e.g.:

$ grep -m 1 -P "\tIDNM" chm13v2.0_RefSeq_Liftoff_v5.gff3
chrY    Liftoff exon    9795914 9796445 .   +   .   IDNM_001320962.1-1;ID=NM_001320962.1;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;extra_copy_number=0

While it didn't come up as a validation issue, I saw a lot of text where I thought it would be ascii characters, but it looked like maybe hex encodings, e.g., GeneID:100289087%2C, testis specific protein Y-linked 10%252C transcript variant etc. Maybe this has something to do with the mention of correct[ing the] special character issues from the original file in the README?

Thank you! Nicole

diekhans commented 1 year ago

The UCSC browser GFF3 parse can't parse this either; it is invalid.

arangrhie commented 1 year ago

Hello @nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1.

@diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.

Let us know in case there are any other issues!

Best, Arang

nrockweiler commented 1 year ago

Wonderful! Thank you so much.

On Thu, Jul 6, 2023, 2:05 PM Arang Rhie @.***> wrote:

Hello @nrockweiler https://github.com/nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1 https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz .

@diekhans https://github.com/diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.

Let us know in case there are any other issues!

Best, Arang

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/82#issuecomment-1624108873, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAJWBJL3HPSY4MZ3CRQF6LXO345PANCNFSM6AAAAAAX77IQVE . You are receiving this because you were mentioned.Message ID: @.***>

marbl / CHM13

Latest JHU RefSeqv110 + Liftoff v5 gff3 does not pass gff3 validation #82