Sequence names changed? - Githubissues

matsengrp / cft

Clonal family tree

5 stars 3 forks source link

Sequence names changed? #168

Closed lauradoepker closed 7 years ago

lauradoepker commented 7 years ago

Just looked at 5555 for the first time in a while and I noticed all the sequence (leaf) names changed to some alphanumeric gibberish? They aren't their old names, which poses a problem for me at the moment because I'm trying to track the sequences from older trees that were used in Brianna's experiments. Can you remind me where to find these old trees? Bri used the parsimony trees for QB850 in mid-late-January and early-February. @metasoarous @WSDeWitt

lauradoepker commented 7 years ago

JK. Found them on stoat:5002. Can we make sure this site remains up? How do I ensure that I can always find this version of the data?

psathyrella commented 7 years ago

@lauranoges this is a result of how we implemented multiple time points: the names vlad assigns are just <order in file>-<multiplicity>, so the same sequence name can occur in both time points. I think the reason we didn't just append the dataset name to the original name was that some program that cft uses has a bug [editorializing here, sorry...] where it can't handle/truncates long names. Our solution was renaming the sequences in the merged data set, and keeping track of the correspondence between new and old names in a separate file. These translation files are here: /fh/fast/matsen_e/data/kate-qrs/vlad-processed-data-with-seed-seqs/

While the idea was of course that we'd do it for you automatically, in the meantime if you want to check something quickly they're just csvs like:

dset,new,original,timepoint
Hs-LN4-5RACE-IgG,000000vlh5,826318-1,1586dpi

metasoarous commented 7 years ago

@lauranoges - Yes; @psathyrella is correct. Take a look at #10 for reference. I think you were vacationing during that time, so we never got feedback from you about whether our approach would be a problem or not.

@psathyrella I'd say this problem is somewhat deeper than just "a bug"... But yes, this has to do with some asinine behavior in dnaml/dnapars. However, I've since changed the way this is getting handled in our processing of the dnaml/dnapars output (to fix #138), so we could actually switch to just prepending Vlad's sequence names with timepoint numbers, as long as they're short; The fix I did for #138 only works if the truncated sequence names are unique within every cluster, and sequence names are truncated to 10 characters. So prefixing with L1-/L3- is probably fine, but 365dpi- would probably not be. Good news is it's set to error if there are uniqueness conflicts after this pruning, and as long as there aren't, the final output will reflect the full, untruncated ids. For the record, the translation file you created is still useful, so please keep that around.

psathyrella commented 7 years ago

hm, 10 characters? I don't know that we can work with a hard limit, because the vlad-names are unbounded in length -- in practice it's unlikely/impossible, since they're sorted by decreasing abundance, but if a sequence toward the end of a file had high abundance the original name could be very long, so we wouldn't even know how many characters to add for the data set name.

If we're comfortable ignoring that possibility, the existing data set shorthands are frequently only two characters: https://github.com/psathyrella/datascripts/blob/master/meta/kate-qrs/meta.csv

metasoarous commented 7 years ago

The longest name I saw in the kate-qrs dataset when I looked at this was 8 characters in length, and everything that long had -1 at the end. So as far as uniqueness goes in this dataset, the last two characters are superfluous. Also, uniqueness would only need to hold per cluster, and only per the 100-300 exemplars downsampled from each cluster. I think if we just prefix with the timepoint/sample shorthand number, and ignore the i/j/k, we should be okay for most bigger datasets we'll see. If a clash ever does come up, we can always wrap dnaml/dnapars with some steps which temporarily replace the full sequence names with shorter identifiers. In fact, we'll likely have to do this in order to generalize cft (#166). For now though, I say we just rock the numeric prefixes.

psathyrella commented 7 years ago

yeah, but the last two characters are where we're getting the multiplicity (right? I mean I'm not using it and if I was I could get it using the translation file, but I thought someone was using it).

I'm reluctant to set up a system that requires checking to see how long the uids are in each sample before deciding what the data set shorthand is going to be, and in this case requires creating a second shorthand for each data set... how about I use the exiting shorthand. It sounds like this will work, just barely?

metasoarous commented 7 years ago

Yes; the last two characters are where we get the multiplicity, but again, we can swap those characters back in as long as the names are unique per downsampled cluster. So we're not going to loose anything in the final output (/UI). And I'll likely extract that data as metadata anyway before we even do the downsampling, so this shouldn't be a concern.

I didn't mean that you should check how long the ids are before deciding on the shorthand. I was suggesting that you just prefix with the dataset's shorthand number across the board, like 1-234334-3. But if you'd rather leave in the i/j/k, that fine with me; In fact, then we don't have to use the - character, since the letter will make this unnecessary (1k234334-3 still parses unambiguously). Even if we do include the - though, you're right, statistically, we'll probably be fine, and if we're not, I'll figure out how to handle this issue more generally on my end. So again, up to you.

psathyrella commented 7 years ago

ok, I will change the data set merging code in datascripts to prepend the data set shorthand, also defined in datascripts, with a dash, to the existing sequence id. Does this involve also immediate re-running, or are we deciding how we'll want it for the next iteration?

metasoarous commented 7 years ago

@lauranoges Your call on whether this should get an immediate re-run or not.

lauradoepker commented 7 years ago

@psathyrella if the next iteration is sometime this week or next, let's do that. If it's farther in the future, please re-run at your convenience.

Thanks -- I'm not sure I followed 100% but I presume the names will go back to looking something like Vlad's names that we were using before.

metasoarous commented 7 years ago

@psathyrella Have you rerun this yet?

psathyrella commented 7 years ago

uh, no! hey, I should do that, sorry

metasoarous commented 7 years ago

@psathyrella Perhaps we could try to time this for our milestone release (June 1st)? If you're ok with them, I may have some unobtrusive additions to datascripts that might help us with #167, and it might be easier to rerun after.

psathyrella commented 7 years ago

ok, I will wait til you tell me to go.

metasoarous commented 7 years ago

OK @psathyrella; We can run whenever you're ready.

I decided not to add any additional output to datascripts at this stage. I may still want to in the future, but for now have a tooling solution to the issue I was trying to solve. So again, please run at your convenience.

psathyrella commented 7 years ago

ok

metasoarous commented 7 years ago

Now that we have a temporary deployment of the new data up, I'm going to close this.