samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

handle utf-8 encoding #1569

Open yash-puligundla opened 3 years ago

yash-puligundla commented 3 years ago

Description

If UTF-8 encoded characters are present in a SAM file, it is being corrupted while writing the file. This is because AsciiWriter downcasts the input char to a byte.

More details here SAM specification specifies different field where UTF-8 encoding is allowed

Fix

cast from String to bytes using str.getBytes(StandardCharsets.UTF_8)

yash-puligundla commented 3 years ago

pushed more commits to re-trigger Travis build. But, that didn't work image

lbergelson commented 3 years ago

So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8?

yash-puligundla commented 3 years ago

So looking at this more closely, it's very unclear to me why we have this class in the first place. It seems to be entirely some sort of performance optimization that's designed to avoid the cost of converting string formats. Is it possible that we could just remove this and replace it with a normal java Writer set to use UTF-8?

@lbergelson Just to clarify, Do you mean the class would be replaced with a normal java Writer set to use UTF-8 for certain fields and Ascii for the rest of the fields that do not permit UTF-8?