rachelss / treecall

Tree-based joint lineage inference and somatic mutation calling
GNU General Public License v2.0
0 stars 2 forks source link

Fully utilize sample names #8

Open reedacartwright opened 6 years ago

reedacartwright commented 6 years ago

Below are some places to change to improve the usability of Treecall.

https://github.com/rachelss/treecall/blob/38778cc1001454a37b5087f05efa1e2ebfc3468a/treecall.py#L188-L198

Additional change

rachelss commented 6 years ago

Why do we think the +1 was there? must have been some reason... @reedacartwright Please check if the annotation function works - gtcall file should have nodes numbered as a list of descendant leaves (based on vcf file); tree should be numbered w leaves based on vcf file and nodes numbered based on children so those should match now

reedacartwright commented 6 years ago

I imagine they wanted to avoid zeros for some reason.

Things I spotted.

reedacartwright commented 6 years ago

What if utils::init_tree was modified to do the sample names->id conversion? It could then create a node.samples variable to keep track of names alongside ids. That may make it easier to write the genotyping and annotation functions.

rachelss commented 6 years ago

If this works I'd rather not bother with gtcall using names. That's what I did initially but then it requires converting back. I'm not sure what you're asking with the command position arguments init_tree is used elsewhere without having real sample names attached (ie for random starting trees) so while it would be nice not to copy-paste the conversion and go back and forth that should really be a separate function. I don't feel like bothering right now - mostly just want it to work.

reedacartwright commented 6 years ago

As a user of treecall, I'm now spotting things that are barriers for users. I'd fix some of these myself, but my python foo is weak.

rachelss commented 6 years ago

Traversed tree incorrectly - should have tree output with sample names now Added a second gtcall file with leaf names - lame but easier to copy-paste than convert Does the usage look better?

reedacartwright commented 6 years ago

The usage looks better. Any reason why the 'output' for nbjoin does not have angled brackets?

I also noticed (via R) that the header for the gtcall is shorter than the body, which I think means a column label is missing.

Creating a second output file, just adds to the confusion, because then the user doesn't know which one to submit to the annotation call. We need to fix annotation to use sample names instead of sample ids.

The way to do that is to to put something like this after the init_tree function to convert the sid vectors back into sample name vectors.

https://github.com/rachelss/treecall/blob/c99aa6f892d90d254eacafea3ab54061955abc1b/geno.py#L182

But it might be simpler to not send to use init_tree which forces the usage of sample ids which is only needed if calculating probabilities.

rachelss commented 6 years ago

Does it work as-is? We can fix aesthetics, but I would like to know what the results are

reedacartwright commented 6 years ago

Yes.

I'm now trying to figure if the v2 trees are an improvement on the Mouse data compared to the v1 trees.