Update required fields in CreateGermlines - Githubissues

immcantation / changeo

Changeo-O is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). Change-O provides tools for processing the output of V(D)J alignment tools, assigning clonal clusters, and reconstructing germline sequences.

https://changeo.readthedocs.io

GNU Affero General Public License v3.0

0 stars 0 forks source link

Update required fields in CreateGermlines #69

Closed ssnn-airr closed 8 years ago

ssnn-airr commented 8 years ago

Original report by ssnn (Bitbucket: ssnn, GitHub: ssnn-airr).

ssnn-airr commented 8 years ago

Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).

I have checked these results by eye, seem to match up. Using the --cloned flag has interesting results, whichever sequence is selected to represent the clone, the N/P counts for that sequence are used to make the germline for the entire clone. This sequence is not always the consensus N/P count for the clone, but is selected (if I recall correctly) for having the longest V and J sequences and the consensus V/J gene calls. I think this looks good enough to close the issue.

ssnn-airr commented 8 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

If this is for theREGIONS_PN log entry, we won't be able to make the N/P fields required by CreateGermlines as we can't get them from IgBLAST (so requiring them would break CreateGermlines). I'm guessing that the cleanest approach would probably be to have the REGIONS log generated differently if these additional fields are found in the input.

ssnn-airr commented 8 years ago

Original comment by ssnn (Bitbucket: ssnn, GitHub: ssnn-airr).

What about creating a --regions subcommand to add the field REGIONS to the final db file? NP1_LENGTH and NP2_LENGTH would be required fields. If N1_LENGTH, P3V_LENGTH,... and the others are found (IMGT style), get REGIONS coded as VPNPDPNPJ, otherwise (IgBLAST style), use the VNDNJ schema.

ssnn-airr commented 8 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

I'm not seeing the benefit. Presumably, the purpose would be so you can parse the 'VNDNJ' string to get the start/length of each region, but you need that info to create the string in the first place. Seems cleaner to just use the start/length fields directly in whatever application needs them. But maybe I'm missing something? Is there another use?

ssnn-airr commented 8 years ago

Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).

Is there a use to having the start and end other than to make the string? I feel like I want the string to use for stuff more so than the positions. Or if anything, have both.

ssnn-airr commented 8 years ago

Original comment by ssnn (Bitbucket: ssnn, GitHub: ssnn-airr).

I think the use is just to have something that can help visualize the different regions in the context of the sequence and the germline.

ssnn-airr commented 8 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

I thought this began from a desire to analyze the N/P sequences separately (length, amino acids, etc), so they needed the positions/lengths to pull out the nucleotides from the input sequence. It would also be useful for doing VH replacement footprint searches.

But if it's just for the sake of visualization, then I think having it in the log alone is sufficient. However, I don't see a downside to adding a regions option to the -g flag to make a GERMLINE_REGIONS field (or whatever we want to call it). I mean, it's another thing we'd need to maintain which seems to have limited use, but if we are already putting it in the log it's not really any more effort.

ssnn-airr commented 8 years ago

Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).

I agree it may make more sense to add the regions flag to CreateGermlines. Right now I really do want to know what region each nt of the gapped sequence is in.

ssnn-airr commented 8 years ago

Original comment by ssnn (Bitbucket: ssnn, GitHub: ssnn-airr).

Ok. The flag, then. If someone needs the info, use the flag, otherwise, don't clutter the db file.

ssnn-airr commented 8 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

@namita1025 I think something like:

v <- cumsum(c(312, df$NP1_LENGTH, df$D_SEQ_LENGTH, df$NP2_LENGTH, df$J_SEQ_LENGTH))
cut(s2c(df$SEQUENCE_IMGT[1]), breaks=v[1])

Would do that. (Syntax totally made up - I'm sure it needs fixing.)

ssnn-airr commented 8 years ago

Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).

No, I think my whole issue is that I want to know which nucleotides in the CDR3 belong to the V....so hard-coding 312 doesn't solve my problem.

ssnn-airr commented 8 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

312 is the start of the CDR3, so you probably need 312 to (V_GERM_LENGTH - 312). I missed some bits in the syntax above, but all the info is already in the db files.

ssnn-airr commented 8 years ago

Original comment by Namita Gupta (Bitbucket: namita1025, GitHub: namita1025).

Donezo