psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 36 forks source link

Converting partis partition yaml to AIRR tsv #304

Closed alexpan82 closed 4 years ago

alexpan82 commented 4 years ago

To whom it may concern,

Is it possible to have both the yaml and airr tsv as simultaneous outputs of partition? I have tried to use view-output to covert the yaml to airr format with no success.

Thank you! Alex

psathyrella commented 4 years ago

Hmm, that's a good point, especially since I think there's a ton of info in the partis yaml that isn't in airr format. It would mostly just involve removing this return, but I'll also have to have it accept a .yaml as output file suffix when --airr-output is specified and just write the airr output to <--outfname>.tsv. I'll fix and update this.

psathyrella commented 4 years ago

ok! lmk if that isn't clear or doesn't work, but I think it's doing the right thing.

https://github.com/psathyrella/partis/blob/master/bin/partis#L381

alexpan82 commented 4 years ago

Thank you for your quick response! It works as expected :)

With regards to your previous reply, I think that providing column(s) in the airr format that specify somatic hyper mutation info (IGHV shm, number of substitutions, etc) instead of just the cigar string can go a long way in bridging the "gap" b/t yaml and airr.

psathyrella commented 4 years ago

oh, well it's super easy to add whatever extra columns you'd like. I think I just added the "mandatory" ones, but it's just a list: https://github.com/psathyrella/partis/blob/dev/python/utils.py#L724. And I couldn't possibly agree more that the cigar format is horrible, I've sunk far too much of my life into writing code that can reliably parse that craziness. Just lmk which columns I should add: https://docs.airr-community.org/en/stable/datarep/rearrangements.html#fields.

alexpan82 commented 4 years ago

I very much appreciate your willingness to help!

I work with leukemia data; as such, my parameters of interest would be _identity (specifically v_identity), c-call (I understand this is out of scope for now), and _start /*_end (cdr[123] and fwr[1234])

I empathize with your pain regarding writing parsers for these complex outputs

psathyrella commented 4 years ago

I added the [vdj]_identity and [vdj]_sequence_{start,end} and cdr3_{start,end} here.

Unfortunately I can't really add cdr[12] and fwk[1234] since we don't store this info in our current germline info format. At the time this was a purposeful decision since the cdr3 is the only one that has a really unambiguous definition, but at this point I wish I'd just thrown in the other ones too since in practice everyone wants the imgt ones, even if in principle they aren't necessarily that consistent. In any case, they should be straightforward to get from the v_sequence_start, v_5p_del, and a fresh download of the imgt germline alignments (keeping in mind of course partis is all 0-indexed).

I would've also added a count of the total mutations in the sequence (or full-sequence identity), since I use that a lot, and the v/d/j_identity columns don't include the non-templated regions so they're incomplete, but the airr definitions don't seem to have a column for that. I guess maybe because they only think in terms of single-sequence annotations (not full-family annotations) so they have no way of measuring mutations in non-templated regions.

alexpan82 commented 4 years ago

Thank you so much for all your updates. These are all great changes that have significantly helped me. Looking forward to looking at our analyzed data!