petitparser / dart-petitparser

Dynamic parser combinators in Dart.
https://pub.dartlang.org/packages/petitparser
MIT License
457 stars 48 forks source link

Parsing Latin1 characters #158

Closed tfgordon closed 10 months ago

tfgordon commented 1 year ago

I need to parser Latin1 characters which are not ASCII. My parser was working with version 4.0.2, but I need to use the newer version of petitparser now due to dependencies with the pdf Flutter package that I also need.

Here's a simplified code snippet which fails:

// letter() extended with Latin 1 characters for coverage of most Western European languages final Parser extChar = letter() | char('ä');

I've also tried using pattern, like this:

final Parser extChar = letter() | pattern("À-ÿ");

This also fails.

Would it be easiest to extend letter() to cover all Latin 1 alphabetic characters?

renggli commented 1 year ago

Note that PetitParser never supported any other encoding but the standard UTF-16 code units of a Dart String. I recommend that you convert your input to Dart before parsing, for example using the built-in Latin1Codec.

I am not aware of a change in how characters are read in a long time. Could you provide a short reproducible test-case that passes with PetitParser 4.0.2, but fails with a newer version?

I agree that the built-in predicates such as letter() are simplistic. It would be great to have built-in support for Unicode character properties. Happy to discuss a possible implementaiton.

tfgordon commented 1 year ago

Thanks for your quick response. I now think the problem is not with PetitParser, but rather was caused by a change in the way I store files, made to be able to deploy the app as a webapp. I am now using the Hive NoSql database. Printing out the output from the database, before I try to parse it with PetitParse, shows that it is corrupting (some?) non-ASCII characters. The characters returned are not ones handled by the grammar so I get a parse error. So I will see if this problem can be fixed and hope that this will solve the parsing problem as well.

tfgordon commented 1 year ago

I'd like to be able to help you with extending the letter() implementation, but I'm afraid that's over my head.

tfgordon commented 1 year ago

I found the problem. Hive encodes strings using UTF8. I just needed to convert them into UTF16 and everything works as it should.