Closed KaraKaraWitch closed 1 month ago
Unfortunately charset-normalizer cannot work around corrupted elements. Even for a single character. Same as https://github.com/jawah/charset_normalizer/issues/354
regards,
Unfortunately charset-normalizer cannot work around corrupted elements. Even for a single character. Same as #354
regards,
Thanks for the reply! Understood that it's a limitation in charaset normalizer.
I added a workaround to my code to check how much corrupted elements is ignorable. If it's more than a certain %, fallback to the guessed_encoding
(from charset normalizer).
For those wondering here's a snippet of the code. It's a bit messy but the comments should be enough:
def get_errored_decodable_counts(encoding: str, data: bytes) -> int:
"""Counts the number of failed unicode characters from a given encoding
Args:
encoding (str): The encoding to test
data (bytes): The bytes data to try and decode from
Returns:
int: The number of fillter unicode characters
"""
return data.decode(encoding, errors="replace").count("\ufffd")
def is_codec_exists(encoding: str):
try:
codecs.lookup(encoding)
return True
except LookupError:
return False
orig_encoding = ""
record_content = b"Bytes content"
failurecounts = None
if orig_encoding and is_codec_exists(orig_encoding):
failurecounts = get_errored_decodable_counts(
orig_encoding, record_content
) / len(record_content)
# ... Further down the code looks like this
# Use original encoding if the original encoding looks better overall.
elif guessed_encoding != orig_encoding and not is_codec_exists(orig_encoding):
# Guess we have to use the guessed encoding
filter_comments += (
f'<[Original Encoding] "{orig_encoding}" does not exist. Using guessed.>'
)
correct_encoding = guessed_encoding
elif guessed_encoding != orig_encoding and failurecounts:
filter_comments += f"<fc of [O] {orig_encoding} [G] {guessed_encoding} [%] {round(failurecounts*100,ndigits=2)}%>"
FC_PERCENT = 0.25
if failurecounts * 100 < FC_PERCENT:
filter_comments += f"<[UseOrig] [EDecode] fc usage < {FC_PERCENT}%>"
correct_encoding = orig_encoding
else:
filter_comments += f"<[UseGuess] [EDecode] fc usage > {FC_PERCENT}>"
Notice I hereby announce that my raw input is not :
File
https://files.catbox.moe/h3bf02.html
Alternatively, it may be downloaded from archive.org since it's from CommonCrawl: https://web.archive.org/web/20240302062735im_/https://gaming.lenovo.com/emea/members/120723-Deminy?s=0e175b5c146036655dd127866a5a7999
Verbose output
Expected encoding
Expected UTF-8. It appears that utf-8 decode fails at 1 specific byte while the rest of the content seems to be utf-8 compatible. In VS code, it looks like this:
Using UTF-8, the content is as expected...
Using gb18030 in vs code, it looks... odd...
Desktop (please complete the following information):
Additional context I've noticed this specific failure condition a couple of times while processing CommonCrawl. 1 workaround is to only allow gb18030 when an initial encoding isn't detected. However I might as well report this detection issue. \:)