samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

some UTF-8 characters aren't being rendered properly on parse #1485

Closed nathandunn closed 4 years ago

nathandunn commented 4 years ago

Before you submit

I have checked. I could not find it.

Description of the issue:

https://github.com/GMOD/Apollo/issues/2498

Your environment:

Steps to reproduce

Given a VCF entry:

2R      23979089        NT_033778.4:g.23979090_23980927del      TCCAGCCGTCAATTTCGATTCCACTTCCAATCCAAACCCAACTCGCATTCGTATCTCATCGCTTCGGTTTCGTTGCGCTGCGCCGAGAGTTTCGTTTTCAGTGCTTCTTCAGTTCACAGTTCAGTTCAGTTCAGTTCGGTGTGGATTCAGTTGGATTCGAATCGATGTTTAGTGTAGATTGGCCAGGACACGGAGAACAAATAAGTCCCTCATCGCGCGGCGTGCAAAACCATCCGAACTAACTAAGTCTAGCCAAGTCTAGTAGCTAGAAACTAGAAACAAGAAAGCTACATACATATGTGCGTAAACCGTGGATGCCACAGAGCTGAGATACATTTAACCAAGCGATCTCGCTTATTGTCGAGTGATAACACTAAGATTGGGCAGAAACATTCGGAACCAGCTGGAGTTCAAACATTATCAAAACATTTGTGTTTCAAAATCAACACACTTAAAGCAAACAAACACAGTCGAAAGCGCGAGTGCTGCAGTATTAGTGGTGACCACACACAGATACACACATGCACACGACCATATCGTTACATACGTGTGTAACTCTCTGCACATAAATGCGAGCCAATCTATTTAATTCAGTTTCAGTGCATTTATAAATTAATAAAATTTACCAAGTGCGAGAGTGAGGGAGCGACACGCACACCTATGCACGCTCATTACACGGACGAGGCGAACGTGAGTCTTGGAATATTGCAAACAACACCAACAACAACAAGAACAACAACAAAACCAAAAGCGAAACAGCAAAAAATAAATAAATAACGAGGAACCAGTTTCACCTTGAGGAGCAATACCTAGTGCACACTCACACTCGCACTCGCATTCACACAAATGAAACAGCCCGATCTTACTCTTACTGCGAGTACGGACACATAGTGCACATATAGTGCATATAGTGCACAGCACAGAGCACAGAGTAGACATAGTGACCACCACATAATTTCGTGATAAAGCCACAGAGAATCGGAGCGCTCCGCCTTATCGGCAACCCACTGCCACTGGTCCGGCTACTATGCTCCAGCGGGGATCGGGACATCATCGCTGGGATAGAGACACAGTGGACACCAGAACTGGGATGCAGTTGCAGCGGCCCCAAACGCATTGAAAGATGATAGCTAAGCCCAACCAGGCCACCACCGAACCACCATTAAGCTTGCGCCCCGGAACAGTGCCAACGGTTCCAGCAACCACCCCAGCCAGACCAGCGACCATCACCATCCAGCGAAGGCATCCAGCCCCGAAAGCGGATTCCACACCCCACACTTTGCCACCGTTCTCGCCTTCGCCTTCGCCAGCGTCGTCGCCTTCGCCAGCGCCAGCGCAAACGCCTGGAGCACAAAAAACACAAAGCCAGGCAGCTATTACTCATCCAGCGGCTGTGGCTTCGCCTTCCGCGCCTGTTGCTGCAGCTGCACCGAAGACCCCCAAGACCCCGGAACCCCGGAGTACCCACACCCACACTCACACTCACAGCCAGCACTTCAGCCCCCCTCCACGCGAATCCGAAATGGACGGCGAAAGATCTCCGAGCCACAGCGGCCATGAAATGACACTGAGCATGGACGGCATCGATTCCAGCCTGGTGTTCGGATCTGCACGGGTTCCTGTCAACTCCAGCACCCCGTACTCGGATGCGACTCGAGTGAGTAACACTGTCTACACTGAGGGAAATTGGGATCAAGTAGGAGGTAGGACTGTATAACCCTTATTATATTTCGGTCTTCGACACCATTTTCCCTAAGGTACATATTTCTGTCCAGGCTGGCAGGAATGTGTTGTTGGTTGATTTGGTCATCTATCCATCCGATTAGAAGATCCGCTCTA T       .       .       hgvs_nomenclature="NT_033778.4:g.23979090_23980927del";geneLevelConsequence="splice_donor_variant|coding_sequence_variant|5_prime_UTR_variant|intron_variant";transcriptLevelConsequence="splice_donor_variant|coding_sequence_variant|5_prime_UTR_variant|intron_variant,splice_donor_variant|coding_sequence_variant|5_prime_UTR_variant|intron_variant,splice_donor_variant|coding_sequence_variant|5_prime_UTR_variant|intron_variant";geneImpact="HIGH";transcriptImpact="HIGH,HIGH,HIGH";allele_symbols="Sox14<sup>Δ15</sup>";allele_symbols_text="Sox14<Δ15>";soTerm="deletion";allele_of_gene_ids="FB:FBgn0005612";allele_of_gene_symbols="Sox14";allele_of_transcript_ids="FB:FBtr0343282,FB:FBtr0072157,FB:FBtr0072158";allele_of_transcript_gff3_ids="FB:FBtr0343282,FB:FBtr0072157,FB:FBtr0072158";allele_of_transcript_gff3_names="Sox14-RC,Sox14-RA,Sox14-RB"

Using a file reader and index (using groovy here, but it shouldn't matter):

        VCFFileReader vcfFileReader = new VCFFileReader(file)
        List<VariantContext> queryResults = vcfFileReader.query(sequenceName, (int) start + 1, (int) end)
        VariantContext vc = queryResults.get(7) // some value

// . .. . println "default encoding ${Charset.defaultCharset()} -> ${System.getProperty('file.encoding')}"

        def variantAttributes = variantContext.getCommonInfo().getAttributes()
        println "attribute: ${variantAttributes.get('allele_symbols')}"

Expected behavior

I would expect to get:

                default encoding UTF-8 -> UTF-8
                attribute: "Sox14<sup>∆15</sup>"

Tell us what should happen.

Actual behavior

I get:

   default encoding UTF-8 -> UTF-8
   attribute: "Sox14<sup>Δ15</sup>"
cmnbroad commented 4 years ago

@nathandunn What is the VCF version of the file being rendered ? The 4.3 spec explicitly states that the encoding is UTF-8, but I think that was a change in VCF4.3. Since htsjdk doesn't implement writing 4.3 (it always writes v4.2), it writes VCF files using ISO-8859-1 encoding, though in looking at the 4.2 spec I don't see that specified anywhere. I think there may have been ambiguities pre-v4.3.

nathandunn commented 4 years ago

@cmnbroad Thanks. This is very helpful.

Looks like we are using the VCF spec with 4.2 and most likely UTF-8.

Are there any short-term plans to move to 4.3 or UTF-8? No worries if not, just trying to plan for this case.

cmnbroad commented 4 years ago

There is currently read support for 4.3 in htsjdk, but not write. There is a desire to get v4.3 writing implemented, especially since there is a v4.4 spec being developed, but I don't think anyone is lined up to do it at the moment.

cmnbroad commented 4 years ago

I'm closing this as resolved, but @nathandunn feel free to re-open if you still think there is something that needs fixing.

nathandunn commented 4 years ago

Thanks @cmnbroad We will look at it for our next sprint.

I take it there is no way to manually set the 4.2 encoding to UTF-8 prior to reading, we are only doing reading)?

I did not see any.

cmnbroad commented 4 years ago

I'm not aware of any way to do that, but for 4.2, I would try ISO-8859-1 for your example (when you do the String conversions).

nathandunn commented 4 years ago

Will do.

Thanks,

Nathan

On Jun 30, 2020, at 3:48 PM, Chris Norman notifications@github.com wrote:

I'm not aware of any way to do that, but for 4.2, I would try ISO-8859-1 for your example (when you do the String conversions).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1485#issuecomment-652086033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFXNKTCS4AGQVRM6JKAMYTRZJTSXANCNFSM4N4WVT6A.

nathandunn commented 4 years ago

@cmnbroad if we fork your codebase and change this line: https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFEncoder.java#L30

will it fix it force the encoding over?

cmnbroad commented 4 years ago

@nathandunn That value is only used for encoding/writing.