Closed lauradoepker closed 7 years ago
Any progress on saving the database and other methods information for these trees? The data look amazing :)
Oops... typed this response last week but never sent.
Can you see /home/matsengrp/working/csmall/cft/output/builds
? That's where the datasets are saved (all data needed to run them). I haven't gotten pairings of the data and cftweb bundled together, but the versions are there in your snapshot so it shouldn't be too difficult to to get back there.
Yes, I have access. I do not see a dataset or even directory for 2017-03-22 which is referenced in the snapshots above. More importantly, are we able to reconstruct these trees on a server? We need them for the project, since they founded our antibody choices. Let me know if/how I can help.
Ah... I didn't realize this was a snapshot that you had saved some time back. That isn't from another instance running somewhere is it?
Unfortunately, I don't have that data build any more :-/ I can try to go back to that exact commit and rerun things if you like? Assuming the input data haven't changed, we should get the same results. How should I prioritize doing this vs our June milestone?
Please prioritize this ahead of the June commit - it's critical that we nail down the methods for these (AWESOME) trees. Given that you don't have the data build anymore, we need to see ASAP if we get exactly the same trees as before or not. Yay publishing!
I've made some progress on this. Check out http://stoat:5554. There you can see dnaml data built from the git commit referenced in your snapshots. That should look like what you saw before, and if there's anything different please let me know. As for the dnapars build, I'm not sure why that isn't loading presently. I'll be looking into that tomorrow however.
OK! This has been fixed! The data should be rebuilt on http://stoat:5554. Let me know if anything looks off, but I think that's the best recreation I'm going to be able to do for you.
You'll notice that the git commit for the parsimony data build is different. In your snapshot it is 38cd9a for the parsimony build, but 0610b4 for the dnaml build. Also note that both snapshots show cftweb was running off of this latter commit. Looking back at the parsimony build git status, you'll also see that there were some pending changes to the SConstruct. I believe this was actually the fix that became 0610b4.
You may also note that the cftweb side of those git status things will look a little different now (different commit hashes, etc). This is because we've pulled cftweb out as its own repository, separate from the cft respository that actually builds the data. This is the way things will be moving forward, so this is what you'll want to reference.
For the record, I'm really hoping that all of this will be much saner very soon as a result of the work we've been doing towards our milestone :-)
Please let me know if you have any questions.
I'm reopening this issue because I need some information from @psathyrella and @metasoarous on these old "Lauren" trees:
Please look above to find the screenshots that mention the versions of CFT and partis. I believe partis was v9 at the time. I'm not sure about CFT versions.
Current pipeline from deep sequence results to CFT output: 1) downsampling = @psathyrella to what degree? i.e. how were the data downsampled for v9? 2) partis with seed = @psathyrella, I'm pretty sure new allele inference was turned OFF in v9? Any other parameters I should know about? 3) healthy filtering = @metasoarous, how many stop codons were allowed? Did we have an in frame filter? Did we use indel reversed sequence sets? 4) FastTree 5) pruning/trimming = @metasoarous , how many leaves did we prune to? 100? Also, this was the old pruning script that has since been fixed up by @matsen and @WSDeWitt. I will just keep note of this fact. 6) CFT
Thank you!
Source, for posterity:
./datascripts/run.py seed-partition --study laura-mb --extra-str=v9 --logfnames|grep --color=never /fh/fast|xargs grep -A5 'n-max-queries [^-]'
./datascripts/run.py seed-partition --study kate-qrs --extra-str=v9 --logfnames|grep --color=never /fh/fast|xargs grep -A5 'n-max-queries [^-]'
Great, thanks @psathyrella . Was IgG downsampled too? Or just IgK? (Not sure which meaning of "that was it" you mean :) )
yes, only igk -- if memory serves igk samples were smaller than igh, but also had vastly larger clones, and the larger clones are typically what eats up memory.
The current version has somewhat more donwsampling, because when the current version was being run I couldn't really use quoll, which has more memory.
This is how things were being filtered at the time:
You'll note that when you click the link above, it will take you to a url where you can switch out the path but look at the same commit. So you can see exactly which version of the trim/prune command (in bin/prune.py
) was being used at the time that data was built. I'm fairly confident that in the commit in question prune was doing the kind-of wrong thing. But we haven't really looked at how different the results are from the two methods. Might be kind of interesting to test that if we feel the need to substantiate the data chosen. But I can say that for those trees, there didn't seem to be as strong a need for the really heavy downsampling. So it may be fine.
We pruned the dnaml to 100 and the dnapars trees to 300.
Here's the bin/prune.py
file as of the commit in question:
https://github.com/matsengrp/cft/blob/lauras-first-almost-immortal-trees/bin/prune.py
As suspected, this is the version where we were just sorting nodes by distance from seed lineage.
I'm going to close for now unless there's more information needed.
@metasoarous: We just got some GREAT data using some of the laura-mb trees. We have recorded all of the information (commits, etc.) from the webpages, but we'd like to download the dataset so we can make sure to recreate the trees for future publications.
The trees were accessed on March 30, 2017. Here are the webpage snapshots: