psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Allel finding continues finding more and more allels #210

Closed krdav closed 7 years ago

krdav commented 8 years ago

I tried the new allel finding feature. The concept is really cool and better still; very important, so props for taking it up. My experience running it was not that good though. I ran it on a bunch of sequences from a single animal with a pretty well defined germline gene set where all the germlines should be included in the IMGT set and therefore already be in partis. Therefore I expected that the allel finding feature would only find a very limited set of new allels that I might have missed.

Instead it found dusins and restarted finding new allels just to find more. In conclusion it looked like overfitting on the shm's.

Can you elaborate a little on how this algorithm works and what mechanisms there are in place to avoid overfitting on the shm's? Also if you would like to test it yourself, let me know then I will prepare a dataset for you.

psathyrella commented 8 years ago

yeah, sorry, writing up a good description of what it's doing has been lurking on my todo list for too long, I'll do that today. EDIT um, that may be optimistic. this week?

It's supposed to keep running repeatedly until it doesn't find any new alleles -- this is how it deals with repertoires that have more than one new allele for a single gene. But two things for your case

krdav commented 8 years ago

Don't push it for me, I don't need it right now. You got the data and their associated germlines.