Closed ssnn-airr closed 8 years ago
Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).
I have checked these results by eye, seem to match up. Using the --cloned
flag has interesting results, whichever sequence is selected to represent the clone, the N/P counts for that sequence are used to make the germline for the entire clone. This sequence is not always the consensus N/P count for the clone, but is selected (if I recall correctly) for having the longest V and J sequences and the consensus V/J gene calls. I think this looks good enough to close the issue.
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
If this is for theREGIONS_PN
log entry, we won't be able to make the N/P fields required by CreateGermlines as we can't get them from IgBLAST (so requiring them would break CreateGermlines). I'm guessing that the cleanest approach would probably be to have the REGIONS
log generated differently if these additional fields are found in the input.
Original comment by ssnn (Bitbucket: ssnn, GitHub: ssnn-airr).
What about creating a --regions subcommand to add the field REGIONS to the final db file? NP1_LENGTH and NP2_LENGTH would be required fields. If N1_LENGTH, P3V_LENGTH,... and the others are found (IMGT style), get REGIONS coded as VPNPDPNPJ, otherwise (IgBLAST style), use the VNDNJ schema.
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
I'm not seeing the benefit. Presumably, the purpose would be so you can parse the 'VNDNJ' string to get the start/length of each region, but you need that info to create the string in the first place. Seems cleaner to just use the start/length fields directly in whatever application needs them. But maybe I'm missing something? Is there another use?
Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).
Is there a use to having the start and end other than to make the string? I feel like I want the string to use for stuff more so than the positions. Or if anything, have both.
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
I thought this began from a desire to analyze the N/P sequences separately (length, amino acids, etc), so they needed the positions/lengths to pull out the nucleotides from the input sequence. It would also be useful for doing VH replacement footprint searches.
But if it's just for the sake of visualization, then I think having it in the log alone is sufficient. However, I don't see a downside to adding a regions
option to the -g
flag to make a GERMLINE_REGIONS
field (or whatever we want to call it). I mean, it's another thing we'd need to maintain which seems to have limited use, but if we are already putting it in the log it's not really any more effort.
Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).
I agree it may make more sense to add the regions flag to CreateGermlines. Right now I really do want to know what region each nt of the gapped sequence is in.
Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).
@namita1025 I think something like:
v <- cumsum(c(312, df$NP1_LENGTH, df$D_SEQ_LENGTH, df$NP2_LENGTH, df$J_SEQ_LENGTH))
cut(s2c(df$SEQUENCE_IMGT[1]), breaks=v[1])
Would do that. (Syntax totally made up - I'm sure it needs fixing.)
Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).
No, I think my whole issue is that I want to know which nucleotides in the CDR3 belong to the V....so hard-coding 312 doesn't solve my problem.
Original report by ssnn (Bitbucket: ssnn, GitHub: ssnn-airr).