Gene name conversion drops sequences; we don't get quite as many sequences as we ask for

matsengrp / vampire

🧛 Deep generative models for TCR sequences 🧛

Apache License 2.0

16 stars 4 forks source link

Closed matsen closed 5 years ago

matsen commented 5 years ago

When we generate sequences using OLGA, we ask for nseqs sequences. After gene name conversion we get about 10% less than that.

matsen commented 5 years ago

More importantly, we should account in our comparisons for the fact that OLGA can't emit certain sequences do to name conversion things.

matsen commented 5 years ago

So, probably we should just toss any sequences from any program or data that aren't in the intersection of both sets.

matsen commented 5 years ago

We are restricting gene usage in the preprocessing script, and generating sequences using Ppost rejection sampling, so this is no longer a problem.