ozekik / lightrdf

A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3
Apache License 2.0
28 stars 2 forks source link

Unable to parse Starwars.ttl #15

Closed supreme-core closed 2 months ago

supreme-core commented 2 months ago

Issue:

  1. Download Starwars Turtle file: Starwars.ttl

  2. Parse through all triples. There's a line on 3569 that causing the parse to crash.

rdfs:label "ประเทศไนเจอร์"@th , "Niġer"@mt , "尼日尔"@zh-SG , "尼日尔"@zh-MY , "尼日尔"@zh-Hans , "尼日尔"@zh-CN , "尼日尔"@zh , "Niyer"@tl , "Ngāika"@mi , "Nigeru"@olo , "ނީޖަރު"@dv , "Nijèr"@gcr , "නයිජර්"@si , "Nnijer"@kab , "Nìger"@co , "Nìger"@pms , "Ниҷер"@tg , "Niiser"@ff , "Níher"@gn , "Niiger"@frr , "Nigerän"@vo , "Nícher"@an , "نایجېر"@ps , "尼日"@zh-classical , "နိုင်ဂျာနိုင်ငံ"@my , "Nìjẹ̀r"@yo , "尼日"@zh-TW , "尼日"@zh-Hant , "尼日"@lzh , "નાઈજર"@gu , "Niijer"@om , "നീഷർ"@ml , "Niseer"@wo , "ኒጄር"@am , "নাইজের"@bpy , "Nixèr"@sc , "नाइजर"@new , "नाइजर"@mai , "नाइजर"@hi , "नाइजर"@dty , "नाइजर"@bho , "नाइजर"@bh , "نیجر"@mzn , "نیجر"@lrc , "نیجر"@fa , "نیجر"@azb , "نيجر"@arz , "نيجر"@ary , "نائجر"@ur , "نائجر"@pnb , "Нигермудин Орн"@xal , "Republiek Niger"@nds , "Pow Nijer"@kw , "Niger"@hif , "Niger"@hak , "Niger"@gsw , "Niger"@gag , "Niger"@fy , "Niger"@fr , "Niger"@fo , "Niger"@fiu-vro , "Niger"@fi , "Niger"@eu , "Niger"@et , "Niger"@en-GB , "Niger"@en-CA , "Niger"@en , "Niger"@ee , "Niger"@dsb , "Niger"@de-CH , "Niger"@de-AT , "Niger"@de , "Niger"@da , "Niger"@cy , "Niger"@cs , "Niger"@crh-Latn , "Niger"@crh , "Niger"@ceb , "Niger"@cdo , "Niger"@bs , "Niger"@br , "Niger"@bjn , "Niger"@ban , "Niger"@az , "Niger"@als , "Niger"@ak , "Niger"@af , "Niger"@ace , "Niger"@bcl , "Niger"@hr , "Niger"@hsb , "Niger"@hu , "Niger"@ia , "Niger"@id , "Niger"@ie , "Niger"@ig , "Niĝero"@eo , "Niger"@ilo , "Niger"@it , "Niger"@jv , "Niger"@kaa , "Niger"@ki , "Niger"@nl , "Niger"@no , "Niger"@simple , "Niger"@ts , "Niger"@uz , "Niger"@vec , "Niger"@lb , "Niger"@vep , "Niger"@lg , "Niger"@li , "Niger"@lij , "Niger"@lmo , "Niger"@vi , "Niger"@vro , "Niger"@war , "Niger"@za , "Niger"@zh-min-nan , "Niger"@min , "Niger"@ms , "Niger"@nah , "Niger"@nan , "Niger"@nb , "Niger"@nds-NL , "Niger"@nn , "Niger"@nov , "Niger"@nso , "Niger"@pam , "Niger"@pap , "Niger"@pl , "Niger"@ro , "Niger"@scn , "Niger"@sco , "Niger"@se , "Niger"@sh , "Niger"@sk , "Niger"@sl , "Niger"@sm , "Niger"@sn , "Niger"@sr-EL , "Niger"@st , "Niger"@stq , "Niger"@su , "Niger"@sv , "Niger"@sw , "Niger"@szy , "Niger"@tk , "Niijir"@pih , "Nizëre"@sg , "Nijar"@ha , "नाईजर"@ne , "ניז'ר"@he , "নাইজার"@bn , "Niher"@zea , "Nigeri"@rw , "Nigeri"@sq , "ニジェール"@ja , "Nizer"@ln , "Nîjer"@ku , "Nigèr"@oc , "ନାଇଜର"@or , "Нігер"@uk , "Нігер"@be-x-old , "Нігер"@be-tarask , "Нігер"@be , "Níxer"@ast , "Níxer"@gl , "ನೈಜರ್"@kn , "Nicer"@diq , "INayijari"@ss , "Ніґер"@rue , "Նիգեր"@hy , "ናይጀር"@ti , "Nijier"@jam , "नीजे"@sa , "नीजे"@pi , "ܢܝܓܪ"@arc , "Nijer"@bm , "Nijer"@din , "Nijer"@io , "Nijer"@kg , "Nijer"@lad , "Nijer"@lfn , "Nijer"@tr , "Nìgeir"@gd , "Nijera"@mg , "Nigi"@ext , "Nigeris"@bat-smg , "Nigeris"@lt , "Nigeris"@sgs , "ניזשער"@yi , "ནི་ཇར།"@bo , "Nayjar"@so , "Yn Neegeyr"@gv , "Νίγηρας"@el , "Nigēra"@lv , "ნიგერი"@xmf , "ნიგერი"@ka , "நைஜர்"@ta , "Нигер"@udm , "Нигер"@tt , "Нигер"@sr-EC , "Нигер"@sr , "Nijè"@ht , "Нигер"@sah , "Нигер"@ru , "Нигер"@os , "Нигер"@mrj , "Нигер"@mn , "Нигер"@mk , "Нигер"@ky , "Нигер"@kk , "Нигер"@ce , "Нигер"@bxr , "Нигер"@bg , "Нигер"@ba , "Нигер"@ady , "Nizɛɛrɩ"@kbp , "Nig·èr"@frp , "INayighe"@zu , "An Nígir"@ga , "니제르"@ko , "نیجەر"@ckb , "Res publica Nigritana"@la , "నైజర్"@te , "नायजर"@mr , "النيجر"@ar , "النيجر"@aeb-Arab , "نائيجر"@sd , "نىگېر"@ug , "Niqir"@qu , "Ńiger"@szl , "မိူင်းၼၢႆးၵျႃး"@shn , "尼日爾"@zh-yue , "尼日爾"@zh-MO , "尼日爾"@zh-HK , "尼日爾"@yue , "尼日爾"@wuu , "ਨਾਈਜਰ"@pa , "Níger"@ca , "Níger"@cbk-zam , "Níger"@es , "Níger"@is , "Níger"@pt , "Níger"@pt-BR ;

Error:

lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 3569 at position 468

ozekik commented 2 months ago

Thank you for reporting!

It seems that the problem is not with lightrdf, but that Starwars.ttl includes an invalid language tag (zh-classical of "尼日"@zh-classical) violating the RDF specification, which states that its subtag part (zh-classical) must not be longer than 8 characters.

While I hope to implement better error handling in the future, at the moment I recommend replacing problematic parts before parsing with valid one as a workaround (maybe lzh?)

supreme-core commented 2 months ago

Appreciate the feedback!