Immortalize these trees for publication

lauradoepker commented 7 years ago

@metasoarous: We just got some GREAT data using some of the laura-mb trees. We have recorded all of the information (commits, etc.) from the webpages, but we'd like to download the dataset so we can make sure to recreate the trees for future publications.

The trees were accessed on March 30, 2017. Here are the webpage snapshots:

screen shot 2017-05-01 at 11 59 15 am

lauradoepker commented 7 years ago

Any progress on saving the database and other methods information for these trees? The data look amazing :)

metasoarous commented 7 years ago

Oops... typed this response last week but never sent.

Can you see /home/matsengrp/working/csmall/cft/output/builds? That's where the datasets are saved (all data needed to run them). I haven't gotten pairings of the data and cftweb bundled together, but the versions are there in your snapshot so it shouldn't be too difficult to to get back there.

lauradoepker commented 7 years ago

Yes, I have access. I do not see a dataset or even directory for 2017-03-22 which is referenced in the snapshots above. More importantly, are we able to reconstruct these trees on a server? We need them for the project, since they founded our antibody choices. Let me know if/how I can help.

metasoarous commented 7 years ago

Ah... I didn't realize this was a snapshot that you had saved some time back. That isn't from another instance running somewhere is it?

Unfortunately, I don't have that data build any more :-/ I can try to go back to that exact commit and rerun things if you like? Assuming the input data haven't changed, we should get the same results. How should I prioritize doing this vs our June milestone?

lauradoepker commented 7 years ago

Please prioritize this ahead of the June commit - it's critical that we nail down the methods for these (AWESOME) trees. Given that you don't have the data build anymore, we need to see ASAP if we get exactly the same trees as before or not. Yay publishing!

metasoarous commented 7 years ago

I've made some progress on this. Check out http://stoat:5554. There you can see dnaml data built from the git commit referenced in your snapshots. That should look like what you saw before, and if there's anything different please let me know. As for the dnapars build, I'm not sure why that isn't loading presently. I'll be looking into that tomorrow however.

metasoarous commented 7 years ago

OK! This has been fixed! The data should be rebuilt on http://stoat:5554. Let me know if anything looks off, but I think that's the best recreation I'm going to be able to do for you.

You'll notice that the git commit for the parsimony data build is different. In your snapshot it is 38cd9a for the parsimony build, but 0610b4 for the dnaml build. Also note that both snapshots show cftweb was running off of this latter commit. Looking back at the parsimony build git status, you'll also see that there were some pending changes to the SConstruct. I believe this was actually the fix that became 0610b4.

You may also note that the cftweb side of those git status things will look a little different now (different commit hashes, etc). This is because we've pulled cftweb out as its own repository, separate from the cft respository that actually builds the data. This is the way things will be moving forward, so this is what you'll want to reference.

For the record, I'm really hoping that all of this will be much saner very soon as a result of the work we've been doing towards our milestone :-)

Please let me know if you have any questions.

lauradoepker commented 7 years ago

I'm reopening this issue because I need some information from @psathyrella and @metasoarous on these old "Lauren" trees:

Please look above to find the screenshots that mention the versions of CFT and partis. I believe partis was v9 at the time. I'm not sure about CFT versions.

Current pipeline from deep sequence results to CFT output: 1) downsampling = @psathyrella to what degree? i.e. how were the data downsampled for v9? 2) partis with seed = @psathyrella, I'm pretty sure new allele inference was turned OFF in v9? Any other parameters I should know about? 3) healthy filtering = @metasoarous, how many stop codons were allowed? Did we have an in frame filter? Did we use indel reversed sequence sets? 4) FastTree 5) pruning/trimming = @metasoarous , how many leaves did we prune to? 100? Also, this was the old pruning script that has since been fixed up by @matsen and @WSDeWitt. I will just keep note of this fact. 6) CFT

Thank you!

psathyrella commented 7 years ago

For v9, BF520 igk was downsampled to 100k when seed partitioning, and that was it.

Source, for posterity:

./datascripts/run.py seed-partition --study laura-mb --extra-str=v9 --logfnames|grep --color=never /fh/fast|xargs grep -A5 'n-max-queries [^-]'
./datascripts/run.py seed-partition --study kate-qrs --extra-str=v9 --logfnames|grep --color=never /fh/fast|xargs grep -A5 'n-max-queries [^-]'

New allele inference was on for v9, but it was a somewhat different version from now, and it was at that time non-default behavior, whereas now allele finding is turned on by default. Everything else was default, except we merged together caprisa and imgt for a starting V germline set.

lauradoepker commented 7 years ago

Great, thanks @psathyrella . Was IgG downsampled too? Or just IgK? (Not sure which meaning of "that was it" you mean :) )

psathyrella commented 7 years ago

yes, only igk -- if memory serves igk samples were smaller than igh, but also had vastly larger clones, and the larger clones are typically what eats up memory.

The current version has somewhat more donwsampling, because when the current version was being run I couldn't really use quoll, which has more memory.

metasoarous commented 7 years ago

This is how things were being filtered at the time:

https://github.com/matsengrp/cft/blob/41247dea8a8729750dde8364984512739f4e0bf4/bin/process_partis.py#L157-L167

You'll note that when you click the link above, it will take you to a url where you can switch out the path but look at the same commit. So you can see exactly which version of the trim/prune command (in bin/prune.py) was being used at the time that data was built. I'm fairly confident that in the commit in question prune was doing the kind-of wrong thing. But we haven't really looked at how different the results are from the two methods. Might be kind of interesting to test that if we feel the need to substantiate the data chosen. But I can say that for those trees, there didn't seem to be as strong a need for the really heavy downsampling. So it may be fine.

We pruned the dnaml to 100 and the dnapars trees to 300.

metasoarous commented 7 years ago

Here's the bin/prune.py file as of the commit in question:

https://github.com/matsengrp/cft/blob/lauras-first-almost-immortal-trees/bin/prune.py

As suspected, this is the version where we were just sorting nodes by distance from seed lineage.

metasoarous commented 7 years ago

I'm going to close for now unless there's more information needed.

matsengrp / cft

Immortalize these trees for publication #171