nmdp-bioinformatics / py-ard

HLA ARD Reduction in Python
https://py-ard.org/
GNU Lesser General Public License v3.0
17 stars 13 forks source link

g-group reduction for three-field alleles that aren't in a G-group is done incorrectly #321

Closed lgragert closed 6 months ago

lgragert commented 7 months ago
>>> allele = ard.redux("DRB4*01:03:01", 'lgx')
>>> allele
'DRB4*01:01'
>>> allele = ard.redux("DRB4*01:03:13", 'lgx')
>>> allele
'DRB4*01:03'
>>> allele = ard.redux("DRB4*01:03:19", 'lgx')
>>> allele
'DRB4*01:03'

This allele is included in DRB4*01:01P but is not currently in any G-group.

https://www.ebi.ac.uk/ipd/imgt/hla/alleles/allele/?accession=HLA28544

The correct result of this redux should be DRB4*01:01.

lgragert commented 7 months ago

Our workaround is to first do a two-field redux, then do the ARD redux.

pbashyal-nmdp commented 7 months ago

I see that it's correctly finding the P group.

❯ pyard -g 'DRB4*01:03:19' -r "P"
DRB4*01:01P
pbashyal-nmdp commented 7 months ago

Or do P first and then lgx

>>> ard.redux(ard.redux('DRB4*01:03:19', 'P'), 'lgx')
'DRB4*01:01'
lgragert commented 7 months ago

I no longer agree with the suggested workaround P first and then lgx.

For the allele C*02:10:02, lgx gives the correct behavior but the workaround is incorrect.

>>> ard.redux("C*02:10:02", 'lgx')
'C*02:10'
>>> ard.redux(ard.redux('C*02:10:02', 'P'), 'lgx')
'C*02:02'
pbashyal-nmdp commented 7 months ago

We do have a ping mode. i.e. P in G I think this maybe what you want.

>>> my_config = {"ping": True}
>>> ard = pyard.init(config=my_config)
>>> ard.redux("DRB4*01:03:01", 'lgx')
'DRB4*01:01'
lgragert commented 7 months ago

Bummer, I tried ping mode and can't get it to work as expected either.

Any tips?

>>> my_config = {"ping": True}
>>> ard = pyard.init(config=my_config)
Creating /var/folders/_g/yg03ywj55jz58thbzgcwn21h0000gp/T/pyard-lgragert/pyard-Latest.sqlite3 as cache.
Version: 3560
>>> ard.redux("DRB4*01:03:19", 'lgx')
'DRB4*01:03'
>>> ard.redux("DRB4*01:03:19", 'P')
'DRB4*01:01P'
pbashyal-nmdp commented 7 months ago

Interesting. It works for DRB4*01:03:01 but not DRB4*01:03:19

>>> import pyard
>>> my_config = {"ping": True}
>>> ard = pyard.init(config=my_config)
>>> ard.redux("DRB4*01:03:19", 'lgx')
'DRB4*01:03'
>>> ard.redux("DRB4*01:03:01", 'lgx')
'DRB4*01:01'

Guess today is DRB4 breaking day :) I'll take a look.

lgragert commented 7 months ago

See above, I revised a response and now disagree with your suggested workaround.

pbashyal-nmdp commented 7 months ago

The solution is to use the ping mode, but I'll have to investigate why it's not working for others besides DRB4*01:03:01

lgragert commented 7 months ago

No no no... ping mode doesn't work either, as shown above. The only valid workaround we found is to first do a two-field redux, then do the ARD redux.

pbashyal-nmdp commented 6 months ago

Reducing twice with ping mode works for the above DRB4 cases.

i.e.

ard.redux(ard.redux(allele)

Here are the DRB4s reduced twice and compared with the P version.

DRB4*01:03              lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:02           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:03           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:04           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:05           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:06           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:07           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:08           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:09           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:10           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:11           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:12           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:13           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:14           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:15           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:16           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:17           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:18           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:19           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:20           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:21           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:22           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:23           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:24           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:25           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:26           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:27           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:28           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:29           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:30           lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:31           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:32           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:33           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:34           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:35           lgx='DRB4*01:03',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:01        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:03        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:04        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:05        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:06        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:07        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:08        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:09        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:10        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:11        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:12        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:14        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:15        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:16        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:17        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:01:18        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:02:01        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
DRB4*01:03:02:02        lgx='DRB4*01:01',   lgx^2=DRB4*01:01    P: DRB4*01:01P Same: True
C*02:10:02              lgx='C*02:10',          lgx^2=C*02:02           P: C*02:02P Same: True

The C*02:10:02 appears in P group but not in G group. From hla_nom_p.txt:

9119,C*,C*02:10:02,C*02:02P,C*02:10,C*02:10:02,C*02:02

So shouldn't C*02:10:02 reduce to C*02:02 in ping mode ?

I know there were issues we had with C*02:10s previously. I don't recall the details right now.

lgragert commented 6 months ago

I suppose you could fix it by automatically applying the redux twice, as the user would not know for what alleles that would need to be done.

Yes, C*02:10:02 should reduce to C*02:02 under the lgx redux.

mmaiers-nmdp commented 6 months ago

I'll organize a meeting on this. Its a bit messy when you consider that no combination of lgx^N or ping:true/false fixes it without introducing another issue

lgragert commented 6 months ago

I've confirmed that DRB4*01:03 indeed occurs in the most recent GLID files I've analyzed and propagates into the haplotype frequency estimates, which has the consequence of inappropriately decreasing the frequency of DRB4*01:01g.

New NMDP GLID/pull files should be generated and new haplotype frequency estimates obtained after the fix is applied to pyARD.

grep "DRB4\*01:03" glid.txt
7013,DRB4*01:03+DRB4*01:03
17360,DRB4*01:01+DRB4*01:03
345530,DRB4*01:01/DRB4*01:07/DRB4*01:10/DRB4*01:11/DRB4*01:12/DRB4*01:13/DRB4*01:14N/DRB4*01:17/DRB4*01:18/DRB4*01:20/DRB4*01:21/DRB4*01:22/DRB4*01:23/DRB4*01:24/DRB4*01:25/DRB4*01:26/DRB4*01:27/DRB4*01:28/DRB4*01:29/DRB4*01:30/DRB4*01:31/DRB4*01:32/DRB4*01:33/DRB4*01:34/DRB4*01:36/DRB4*01:37/DRB4*01:38N/DRB4*01:39/DRB4*01:40/DRB4*01:41/DRB4*01:42/DRB4*01:44/DRB4*01:45/DRB4*01:47/DRB4*01:48/DRB4*01:49/DRB4*01:50/DRB4*01:51/DRB4*01:52/DRB4*01:53/DRB4*01:54/DRB4*01:55/DRB4*01:56N/DRB4*01:57N/DRB4*01:58/DRB4*01:59/DRB4*01:61N/DRB4*01:63/DRB4*01:64/DRB4*01:65N/DRB4*01:66/DRB4*01:67/DRB4*01:68/DRB4*01:69/DRB4*01:70/DRB4*01:71N/DRB4*01:73/DRB4*01:74/DRB4*01:75/DRB4*01:77/DRB4*01:78/DRB4*01:80N/DRB4*01:81/DRB4*01:82/DRB4*01:84N/DRB4*01:86/DRB4*01:87/DRB4*01:88/DRB4*01:89/DRB4*01:91/DRB4*01:92/DRB4*01:93/DRB4*01:94/DRB4*01:95/DRB4*01:96/DRB4*01:97/DRB4*01:98/DRB4*01:99/DRB4*01:100/DRB4*01:101/DRB4*01:102/DRB4*01:103/DRB4*01:104/DRB4*01:105/DRB4*01:106/DRB4*01:108N/DRB4*01:109/DRB4*01:110/DRB4*01:111/DRB4*01:112/DRB4*01:113N/DRB4*01:114/DRB4*01:115N/DRB4*01:116+DRB4*01:03
388288,DRB4*01:02+DRB4*01:03
433241,DRB4*01:01/DRB4*01:02/DRB4*01:04/DRB4*01:05/DRB4*01:07/DRB4*01:08/DRB4*01:10/DRB4*01:11/DRB4*01:12/DRB4*01:13/DRB4*01:14N/DRB4*01:15/DRB4*01:16N/DRB4*01:17/DRB4*01:18/DRB4*01:19/DRB4*01:20/DRB4*01:21/DRB4*01:22/DRB4*01:23/DRB4*01:24/DRB4*01:25/DRB4*01:26/DRB4*01:27/DRB4*01:28/DRB4*01:29/DRB4*01:30/DRB4*01:31/DRB4*01:32/DRB4*01:33/DRB4*01:34/DRB4*01:35/DRB4*01:36/DRB4*01:37/DRB4*01:38N/DRB4*01:39/DRB4*01:40/DRB4*01:41/DRB4*01:42/DRB4*01:43/DRB4*01:44/DRB4*01:45/DRB4*01:46/DRB4*01:47/DRB4*01:48/DRB4*01:49/DRB4*01:50/DRB4*01:51/DRB4*01:52/DRB4*01:53/DRB4*01:54/DRB4*01:55/DRB4*01:56N/DRB4*01:57N/DRB4*01:58/DRB4*01:59/DRB4*01:60/DRB4*01:61N/DRB4*01:62/DRB4*01:63/DRB4*01:64/DRB4*01:65N/DRB4*01:66/DRB4*01:67/DRB4*01:68/DRB4*01:69/DRB4*01:70/DRB4*01:71N/DRB4*01:72/DRB4*01:73/DRB4*01:74/DRB4*01:75/DRB4*01:77/DRB4*01:78/DRB4*01:80N/DRB4*01:81/DRB4*01:82/DRB4*01:83/DRB4*01:84N/DRB4*01:86/DRB4*01:87/DRB4*01:88/DRB4*01:89/DRB4*01:91/DRB4*01:92/DRB4*01:93/DRB4*01:94/DRB4*01:95/DRB4*01:96/DRB4*01:97/DRB4*01:98/DRB4*01:99/DRB4*01:100/DRB4*01:101/DRB4*01:102/DRB4*01:103/DRB4*01:104/DRB4*01:105/DRB4*01:106/DRB4*01:108N/DRB4*01:109/DRB4*01:110/DRB4*01:111/DRB4*01:112/DRB4*01:113N/DRB4*01:114/DRB4*01:115N/DRB4*01:116/DRB4*01:117/DRB4*01:120/DRB4*01:121N/DRB4*01:122/DRB4*01:125/DRB4*01:131/DRB4*01:133/DRB4*01:135/DRB4*01:137/DRB4*01:139/DRB4*01:140/DRB4*01:149N/DRB4*01:150/DRB4*01:154/DRB4*01:157/DRB4*01:159N/DRB4*01:160/DRB4*01:163/DRB4*01:170/DRB4*01:172+DRB4*01:03
601755,DRB4*01:03+DRB4*01:44
862578,DRB4*01:03+DRB4*01:31
1131079,DRB4*01:03+DRB4*01:05
pbashyal-nmdp commented 6 months ago

I suppose you could fix it by automatically applying the redux twice, as the user would not know for what alleles that would need to be done.

Yes, C*02:10:02 should reduce to C*02:02 under the lgx redux.

See PR #323 that does reduction twice if necessary in ping mode.

import pyard
ard = pyard.init(config={'ping': True})
ard.redux('C*02:10:02', 'lgx')
#  'C*02:02'

We can discuss the PR during the meeting.