Closed nathandunn closed 4 years ago
@nathandunn What is the VCF version of the file being rendered ? The 4.3 spec explicitly states that the encoding is UTF-8, but I think that was a change in VCF4.3. Since htsjdk doesn't implement writing 4.3 (it always writes v4.2), it writes VCF files using ISO-8859-1
encoding, though in looking at the 4.2 spec I don't see that specified anywhere. I think there may have been ambiguities pre-v4.3.
@cmnbroad Thanks. This is very helpful.
Looks like we are using the VCF spec with 4.2 and most likely UTF-8.
Are there any short-term plans to move to 4.3 or UTF-8? No worries if not, just trying to plan for this case.
There is currently read support for 4.3 in htsjdk, but not write. There is a desire to get v4.3 writing implemented, especially since there is a v4.4 spec being developed, but I don't think anyone is lined up to do it at the moment.
I'm closing this as resolved, but @nathandunn feel free to re-open if you still think there is something that needs fixing.
Thanks @cmnbroad We will look at it for our next sprint.
I take it there is no way to manually set the 4.2 encoding to UTF-8 prior to reading, we are only doing reading)?
I did not see any.
I'm not aware of any way to do that, but for 4.2, I would try ISO-8859-1 for your example (when you do the String conversions).
Will do.
Thanks,
Nathan
On Jun 30, 2020, at 3:48 PM, Chris Norman notifications@github.com wrote:
I'm not aware of any way to do that, but for 4.2, I would try ISO-8859-1 for your example (when you do the String conversions).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1485#issuecomment-652086033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFXNKTCS4AGQVRM6JKAMYTRZJTSXANCNFSM4N4WVT6A.
@cmnbroad if we fork your codebase and change this line: https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFEncoder.java#L30
will it fix it force the encoding over?
@nathandunn That value is only used for encoding/writing.
Before you submit
I have checked. I could not find it.
Description of the issue:
https://github.com/GMOD/Apollo/issues/2498
Your environment:
Steps to reproduce
Given a VCF entry:
Using a file reader and index (using groovy here, but it shouldn't matter):
// . .. . println "default encoding ${Charset.defaultCharset()} -> ${System.getProperty('file.encoding')}"
Expected behavior
I would expect to get:
Tell us what should happen.
Actual behavior
I get: