scharch / SONAR

Software for Ontogenic aNalysis of Antibody Repertoires
GNU General Public License v3.0
17 stars 10 forks source link

ID/DIV script can't recognize certain V IDs from the germDB files #21

Open ressy opened 8 months ago

ressy commented 8 months ago

The BU_DD germline V files have a handful of entries with an "ORF_" prefix, and when the ID/DIV script tries to parse a V call that start with one of those names (with (v_call|V_gene)=((IG[HKL]V[^*]+)[^,\s]+)) it doesn't recognize any valid V calls for the sequence, so it puts "unknown" and "NA" in the output table for the V call and identity.

I wouldn't think it would much matter for sequences assigned to the ORFs anyway, except that I notice in practice those are usually followed by one or more regular matches (e.g. v_call=ORF_IGHV3-AHH-X*01,IGHV3-AFR*01) so the effect is to exclude those sequences even though they often do have a regular V call available. Would it work to either allow characters before the "IG" in the pattern, or split on the comma and select from the resulting list? I can propose something if one of those sounds preferable.

scharch commented 7 months ago

Sorry, this fell through the cracks. I think the best answer is to remove the ORFs from the default database, which will happen when better databases are released (I think expected this year). But the regex is a bit brittle regardless. I think I have a more complicated one elsewhere (3.2 and/or 4.4) that can be copied over...

ressy commented 7 months ago

No problem, I wasn't even going to bother making an issue at first until I realized the non-ORF matches got missed if there's an ORF one in front. I also don't think it comes up at all often for us (and even less so now that I'm generally using KIMDB for rhesus heavy chain, though a more recent database for all loci would be great).