problems with different file encodings

franzhollerer commented 6 years ago

We should provide a command line option which specifies the encoding used for the files.

Unfortunaltey, this does not solve the problem when different file encodings are used throughout the project.

I much better solution would be to detect the used encoding on a per file bases, similar as the file command is doing it.

franzhollerer commented 5 years ago

Update: We definitely should find a way to detect the encouding. This must work under Linux and Windows. If the encoding is too strange to detect reliably we should stop with a meaningful error message and ask the user to change the encoding into a more common one.

franzhollerer commented 5 years ago

Colin Cameron wrote: Interesting project, I hope that it gets some interest. I had a look around, the whole 'hodea' project is interesting.

All the scripts I have just assume and validate against one particular encoding, generate an error/warning if it doesn’t match but continue to parse the file as best it can ignoring errors.

There’s a character encoding detection python library you could consider: chardet

I would suggest a command line argument to specify the encoding and make ‘--detect' one of the options for that. If --detect is specified the code assumes that chardet is installed and uses that.

I checked what Doxygen does (the closest similar tool I could think of) and it assumes UTF-8 unless the user says otherwise.

Cheers,

Col -

hodea / hodea-review-minder

problems with different file encodings #18