hammerlab / hlarp

Normalize HLA typing output.
Apache License 2.0
6 stars 1 forks source link

Error when reading seq2HLA output containing bad characters #25

Closed jburos closed 7 years ago

jburos commented 7 years ago

I am getting the following error when reading seq2HLA output:

$ hlarp seq2HLA /path/to/*-concatseq2hla-workdir/
hlarp: internal error, uncaught exception:
       Scanf.Scan_failure("scanf: bad input at char number 30: float_of_string")

Fatal error: exception Failure("cmdliner error")

When I look at the seq2HLA output, I see a bad character (') in the *-ClassI.HLAgenotype4digits file. This may or may not be the cause of the error, but it does happen to be the 31st character in this file.

$ cat /path/to/*-concatseq2hla-workdir/rna-patient_id_T-ClassI.HLAgenotype4digits
#Locus  Allele 1    Confidence  Allele 2    Confidence
A   A*29:02 7.023651e-07    A*02:01'    0.001507468
B   B*44:02'    0.0 B*44:02 0.0483829
C   C*05:01 0.03815254  C*16:01 0.06750633

This character appears in a number of our seq2HLA outputs from epidisco. Not sure if this is an epidisco issue or a now standard output from seq2HLA that should be accommodated in hlarp.

Let me know if I can provide more info.

rleonid commented 7 years ago

I understand the source of the problem. Could you please email me the offending seq2HLA file. If I remember correctly, that chick mark had some special significance to seq2HLA that I'll dig into.

jburos commented 7 years ago

Thanks @rleonid! I will email you the output from seq2HLA for this sample. Agreed, I do think there is some significance to these.

rleonid commented 7 years ago

I think the latest version parses this correctly, I wonder what version you ended up using.

$ ./hlarp seq2HLA test/mnt/nfs-pool-rcc/biokepi/results/bms/RCC_v01_with-rna_Sample_11_8-normal-CA209009_11_8_N-tumor-CA209009_11_8_T_SCR-rna-CA209009_11_8_T_C2D8-b37/48c05b974b6d025ad8ef06bb9470bef9rna-CA209009_11_8_T_C2D8edsl-concatseq2hla-workdir/
class,allele,qualifier,confidence,run
1,A*02:01',,0.001507,rna-CA209009_11_8_T_C2D8
1,A*29:02,,0.000001,rna-CA209009_11_8_T_C2D8
1,B*44:02,,0.048383,rna-CA209009_11_8_T_C2D8
1,B*44:02',,0.000000,rna-CA209009_11_8_T_C2D8
1,C*05:01,,0.038153,rna-CA209009_11_8_T_C2D8
1,C*16:01,,0.067506,rna-CA209009_11_8_T_C2D8
2,DQA1*02:01,,,rna-CA209009_11_8_T_C2D8
2,DQA1*02:01,,0.000000,rna-CA209009_11_8_T_C2D8
2,DQB1*02:02',,0.000000,rna-CA209009_11_8_T_C2D8
2,DQB1*03:02',,0.000407,rna-CA209009_11_8_T_C2D8
2,DRB1*07:01,,0.000000,rna-CA209009_11_8_T_C2D8
2,DRB1*14:103,,0.279392,rna-CA209009_11_8_T_C2D8

Though it is true that this is apostrophe is an ambiguity indicator in seq2HLA output (ie. https://bitbucket.org/hammerlab/seq2hla/src/edc2b613de435e88bf6fc688324613ad50ee7453/seq2HLA.py?at=default&fileviewer=file-view-default#seq2HLA.py-227).

Looking through my notes, I can't remember the exact (statistical) meaning of ambiguous in this case.