maedat / GFF2MSS

GFF2MSS; GFF3 converter for DDBJ submission via MSS
MIT License
6 stars 2 forks source link

mismatch with Ginger gff output #21

Open myoshida0215 opened 1 week ago

myoshida0215 commented 1 week ago

Dear provider,

When I tried using Ginger (v1.0.1) output gff, that gave us the following error. https://academic.oup.com/dnaresearch/article/30/4/dsad017/7227702

Could you suggest what might be the cause?

Traceback (most recent call last): File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/GFF2MSS.py", line 558, in gff_df_col = gff_df.attributes_to_columns() File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/gffpandas/gffpandas.py", line 132, in attributes_to_columns attribute_df['at_dic'] = attribute_df.attributes.apply( File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/pandas/core/series.py", line 4917, in apply return SeriesApply( File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/pandas/core/apply.py", line 1427, in apply return self.apply_standard() File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/pandas/core/apply.py", line 1507, in apply_standard mapped = obj._map_values( File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/pandas/core/base.py", line 921, in _map_values return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert) File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/pandas/core/algorithms.py", line 1743, in map_array return lib.map_infer(values, mapper, convert=convert) File "lib.pyx", line 2972, in pandas._libs.lib.map_infer File "/Users/yoshidamasaaki/Documents/Data/PAGS2023/2023.10.16/GFF2MSS-master/MSS/lib/python3.9/site-packages/gffpandas/gffpandas.py", line 133, in lambda attributes: dict([key_value_pair.split('=') for ValueError: dictionary update sequence element #1 has length 1; 2 is required

maedat commented 1 week ago

Thank you for sharing the details of the error with Ginger (v1.0.1). From the traceback, it seems that the issue is related to how the GFF attributes are being parsed, particularly where key-value pairs are expected but not found.

Could you kindly share a portion of the GFF file generated by Ginger that triggered this error? This would help us better understand the cause and provide more specific guidance.

myoshida0215 commented 1 week ago

Yes, these are our files that are giving the error. test.1.gff.txt test.1.fa.txt

maedat commented 1 week ago

The error in the GFF file seems to be caused by the use of semicolons (;) within the ID field, which the parser interprets as a delimiter for separating attributes. (e.g., ID=mRNA_1;Aargo017135; )According to the GFF specification, semicolons should be used exclusively to separate key=value pairs in the attributes field. To resolve this, the problematic semicolons within the ID field can be replaced with another symbol, such as a colon (:), to prevent misinterpretation.

It seems that semicolons (;) are improperly used not only within the ID field but also in other key-value pairs within the attributes field. Based on your example, keys like Note, gene, and potentially others also suffer from this issue. To address this comprehensively, the goal is to ensure that all semicolons within values (and not between key-value pairs) are replaced with a different delimiter, such as a colon (:).

This oneliner command may work with your gff file

sed -E 's/(ID|Parent|Note|gene)=([^;]+);([^;]+)/\1=\2:\3/g' input.gff > output.gff