lgragert / nn-sero-pytorch

PyTorch version of neural network HLA serology prediction
2 stars 1 forks source link

Imputation of missing HLA sequences #33

Closed gbiagini closed 4 years ago

gbiagini commented 4 years ago

Impute HLA sequences missing from the msf files using Hamming distance.

Considerations:

  1. How to know what part of a sequence is "missing" and what simply does not exist - i.e. what about alleles that are simply truncated (either at start or end)?
  2. What threshold should be used for sequence inference? This could probably be answered through a literature review.
    • I'm assuming the inference will be done using some sort of moving window of residues for the calculations, but how many residues wide should the window be? (Trial and error?)
gbiagini commented 4 years ago

Imputation algorithm implemented with merge of feature/imputation branch. Work can be done to enhance the imputation, but the algorithm seems to be accurate in a general comparison with the old RSNNS pat files. Currently, there are less than 3 alleles for each locus with any sequence differences between the two systems. Those "different" alleles will be investigated.