Fully utilize sample names

reedacartwright commented 6 years ago

Below are some places to change to improve the usability of Treecall.

[x] Don't write two different versions of the same tree. Just write the one with sample names. Delete this: https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/tree_est.py#L99
[x] Drop the names part so that the file extension is just .tree or .tre if you prefer that. https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/tree_est.py#L105
[x] Branch lengths don't matter to Treecall. So use format=9 when converting to Newick. That way users don't think that the lengths matter. https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/tree_est.py#L105
[x] Find the best scoring/lowest tree and write that to a .best.tree file. https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/tree_est.py#L110

[ ] Fix annot function to use sample names. Right now it doesn't work at all, and throws this error:

Namespace(func=<function annotate_main at 0x7f67dd4a8ed8>, gtcall='M2.1.gtcall', output='M2.1.annot.tre', tree='M2.1names.tre')
Traceback (most recent call last):
File "../treecall.py", line 382, in <module>
args.func(args)
File "../treecall.py", line 192, in annotate_main
tree = init_tree(tree)
File "/home/reed/Projects/treecall_v2/utils.py", line 102, in init_tree
tree.leaf_order = map(int, tree.get_leaf_names())
ValueError: invalid literal for int() with base 10: 'ERS213579'

https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/treecall.py#L188-L198

[x] Remove the +1 https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/treecall.py#L197

Additional change

[x] Add the header to gtype table output?

rachelss commented 6 years ago

Why do we think the +1 was there? must have been some reason... @reedacartwright Please check if the annotation function works - gtcall file should have nodes numbered as a list of descendant leaves (based on vcf file); tree should be numbered w leaves based on vcf file and nodes numbered based on children so those should match now

reedacartwright commented 6 years ago

I imagine they wanted to avoid zeros for some reason.

Things I spotted.

[ ] gtcall file should use sample names not numbers to be consistent with other parts of the output.
[ ] annot command position arguments should be gtcall> to be consistent with gtype command order.
[x] annot command should output a tree using sample names, not ids.

reedacartwright commented 6 years ago

What if utils::init_tree was modified to do the sample names->id conversion? It could then create a node.samples variable to keep track of names alongside ids. That may make it easier to write the genotyping and annotation functions.

rachelss commented 6 years ago

If this works I'd rather not bother with gtcall using names. That's what I did initially but then it requires converting back. I'm not sure what you're asking with the command position arguments init_tree is used elsewhere without having real sample names attached (ie for random starting trees) so while it would be nice not to copy-paste the conversion and go back and forth that should really be a separate function. I don't feel like bothering right now - mostly just want it to work.

reedacartwright commented 6 years ago

As a user of treecall, I'm now spotting things that are barriers for users. I'd fix some of these myself, but my python foo is weak.

Using sample indexes in the gtcall files makes it harder for users to interpret the results if the trees are using sample names.
- Also, if I provide a tree to annotate using sample names, why is it returning one with labels as sample ids?
- The usage lines are also inconsistent
- treecall.py nbjoin [-h] [-m INT] [-e INT] [-v INT] <vcf> output
- treecall.py gtype [-h] -t FILE [-n INT] [-m INT] [-e INT] <vcf> <output>
- treecall.py annot [-h] -t FILE <gtcall> <outnwk> <vcf> Why is the VCF at the end of annot, instead of right after -t FILE, which one would expect based on the other two commands? Why is the output file not at the end like it is in the other two?

rachelss commented 6 years ago

Traversed tree incorrectly - should have tree output with sample names now Added a second gtcall file with leaf names - lame but easier to copy-paste than convert Does the usage look better?

reedacartwright commented 6 years ago

The usage looks better. Any reason why the 'output' for nbjoin does not have angled brackets?

I also noticed (via R) that the header for the gtcall is shorter than the body, which I think means a column label is missing.

Creating a second output file, just adds to the confusion, because then the user doesn't know which one to submit to the annotation call. We need to fix annotation to use sample names instead of sample ids.

The way to do that is to to put something like this after the init_tree function to convert the sid vectors back into sample name vectors.

https://github.com/rachelss/treecall/blob/c99aa6f892d90d254eacafea3ab54061955abc1b/geno.py#L182

But it might be simpler to not send to use init_tree which forces the usage of sample ids which is only needed if calculating probabilities.

rachelss commented 6 years ago

Does it work as-is? We can fix aesthetics, but I would like to know what the results are

reedacartwright commented 6 years ago

Yes.

Running genotype using different trees, but same topology, produces identical results.
Annotate calculates branch lengths correctly.

I'm now trying to figure if the v2 trees are an improvement on the Mouse data compared to the v1 trees.

rachelss / treecall

Fully utilize sample names #8